So far, we have learned about various techniques that help predict the value of a dependent variable based on its independent features and their relationship. In this lesson, we will be addressing the concept of an algorithm based on Support Vectors that can be used for performing both linear and non-linear regressions.
Support Vector Regression (SVR) is a supervised learning model that can be used to perform both linear and nonlinear regressions. In the previous lessons, we learned that the goal of applying linear regression is to minimize the error between the prediction and data. However, the goal of applying Support Vector Regression to a data set is to make sure that the errors do not exceed the threshold. In SVR, we fit as many instances as possible between the lines while limiting the margin violation. An SVR model uses the following hyperparameters in its model that determine the performance of the model.
This section will walk you through a step-wise Python implementation of the prediction process that we just discussed.
First, let us import some essential Python libraries.
# Importing the libraries import numpy as np # for array operations import pandas as pd # for working with DataFrames import requests, io # for HTTP requests and I/O commands import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn modules from sklearn.model_selection import train_test_split # for splitting the data from sklearn.metrics import mean_squared_error # for calculating the cost function from sklearn.preprocessing import StandardScaler # for scaling the data from sklearn.svm import SVR # for building the model
For this problem, we will be loading a CSV dataset that you can download from here. The data set consists of temperature and pressure logs. We will be loading the data set using the read_csv() function from the pandas library and store it as a pandas DataFrame object.
# Importing the dataset from the url of the data set url = "https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/pressure.csv" data = requests.get(url).content # Reading the data dataset = pd.read_csv(io.StringIO(data.decode('utf-8')), index_col = 'Unnamed: 0') dataset.head()
dataset.describe()
After loading the dataset, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the feature (Temperature) and the target variable (Pressure).
x = dataset.iloc[:, [0]].values # Temperature values y = dataset.iloc[:, [1]].values # Pressure values
The data is scaled using StandardScaler() module of scikit-learn that standardizes the values.
# Feature scaling sc_x = StandardScaler() x = sc_x.fit_transform(x) sc_y = StandardScaler() y = sc_y.fit_transform(y)
We use the train_test_split() module of scikit-learn for splitting the data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.
# Splitting the dataset into training and testing set (80/20) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)
After splitting the data into dependent and independent variables, the Support Vector Regression model is fitted with the training data using the SVR() class from scikit-learn.
# Initializing the SVR model with 10 decision trees model = SVR(kernel = 'rbf') # Fitting the SVR model to the data model.fit(x_train, y_train.ravel())
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).
$$RMSE = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} - \hat{y_{i}})^{2}}$$
where,
$y_i$ is the actual target value,
$\hat{y_{i}}$ is the predicted target value, and
$n$ is the total number of data points.
The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model's predictions.
# Predicting the target values of the test set y_pred = model.predict(x_test) # RMSE (Root Mean Square Error) rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f')) print("\nRMSE: ", rmse)
RMSE: 1.005
The final code for the implementation of Support Vector Regression in Python is as follows.
# Importing the libraries import numpy as np # for array operations import pandas as pd # for working with DataFrames import requests, io # for HTTP requests and I/O commands import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn modules from sklearn.model_selection import train_test_split # for splitting the data from sklearn.metrics import mean_squared_error # for calculating the cost function from sklearn.preprocessing import StandardScaler # for scaling the data from sklearn.svm import SVR # for building the model # Importing the dataset from the url of the data set url = "https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/pressure.csv" data = requests.get(url).content # Reading the data dataset = pd.read_csv(io.StringIO(data.decode('utf-8')), index_col = 'Unnamed: 0') x = dataset.iloc[:, [0]].values #Temperature values y = dataset.iloc[:, [1]].values #Pressure values # Feature Scaling sc_x = StandardScaler() x = sc_x.fit_transform(x) sc_y = StandardScaler() y = sc_y.fit_transform(y) # Splitting the dataset into training and testing set (80/20) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28) # Initializing the SVR model with 10 decision trees model = SVR(kernel = 'rbf') # Fitting the SVR model to the data model.fit(x_train, y_train.ravel()) # Predicting the results y_pred = model.predict(x_test) # Predicting the target values of the test set y_pred = model.predict(x_test) # RMSE (Root Mean Square Error) rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f')) print("\nRMSE: ", rmse)
In this lesson, we learned about the Support Vector Regression along with its implementation in Python.
This marks the end of Regression section for this course. We will now move on to discuss the concept of classification and implement different kinds of classification algorithms in Machine learning using Python.