Support Vector Regression

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

So far, we have learned about various techniques that help predict the value of a dependent variable based on its independent features and their relationship. In this lesson, we will be addressing the concept of an algorithm based on Support Vectors that can be used for performing both linear and non-linear regressions.


What is Support Vector Regression?

Support Vector Regression (SVR) is a supervised learning model that can be used to perform both linear and nonlinear regressions. In the previous lessons, we learned that the goal of applying linear regression is to minimize the error between the prediction and data. However, the goal of applying Support Vector Regression to a data set is to make sure that the errors do not exceed the threshold. In SVR, we fit as many instances as possible between the lines while limiting the margin violation. An SVR model uses the following hyperparameters in its model that determine the performance of the model.

  • Kernel: The function used to map lower-dimensional data into higher dimensional data.
  • Hyper Plane: The separation line between the data classes. For a Support Vector Regression problem, a hyperplane is a line that will help us predict the continuous value or target value.
  • Decision Boundary line: The boundary lines are essentially the decision boundaries of the hyperplane. The support vectors can be on the Boundary lines or outside it. The best fit line is determined on the basis of the hyperplane having the maximum number of points inside its boundary line.
  • Support Vectors are the data points that are closest to the decision boundary. The distance of the points is minimum or least.
Support Vector Regression

Support Vector Regression in Python

[latexpage]

This section will walk you through a step-wise Python implementation of the prediction process that we just discussed.

1. Importing necessary libraries

First, let us import some essential Python libraries.

# Importing the libraries
import numpy as np # for array operations
import pandas as pd # for working with DataFrames
import requests, io # for HTTP requests and I/O commands
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.metrics import mean_squared_error # for calculating the cost function
from sklearn.preprocessing import StandardScaler # for scaling the data
from sklearn.svm import SVR # for building the model

2. Importing the data set

For this problem, we will be loading a CSV dataset that you can download from here. The data set consists of temperature and pressure logs. We will be loading the data set using the read_csv() function from the pandas library and store it as a pandas DataFrame object.

# Importing the dataset from the url of the data set
url = "https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/pressure.csv"
data = requests.get(url).content

# Reading the data
dataset = pd.read_csv(io.StringIO(data.decode('utf-8')), index_col = 'Unnamed: 0')
dataset.head()
Support Vector Regression Features
dataset.describe()
Support Vector Regression dataset describe

3. Separating the features and the target variable

After loading the dataset, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the feature (Temperature) and the target variable (Pressure).

x = dataset.iloc[:, [0]].values # Temperature values
y = dataset.iloc[:, [1]].values # Pressure values

4. Feature Scaling

The data is scaled using StandardScaler() module of scikit-learn that standardizes the values.

# Feature scaling
sc_x = StandardScaler()
x = sc_x.fit_transform(x)

sc_y = StandardScaler()
y = sc_y.fit_transform(y)

5. Splitting the data into a train set and a test set

We use the train_test_split() module of scikit-learn for splitting the data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

6. Fitting the model to the training set

After splitting the data into dependent and independent variables, the Support Vector Regression model is fitted with the training data using the SVR() class from scikit-learn.

# Initializing the SVR model with 10 decision trees
model = SVR(kernel = 'rbf')

# Fitting the SVR model to the data
model.fit(x_train, y_train.ravel())
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

7. Calculating the loss after training

Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).

$$RMSE = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} – \hat{y_{i}})^{2}}$$

where,
$y_i$ is the actual target value, 
$\hat{y_{i}}$ is the predicted target value, and
$n$ is the total number of data points.

The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f'))
print("\nRMSE: ", rmse)
RMSE: 1.005

Putting it all together

The final code for the implementation of Support Vector Regression in Python is as follows.

# Importing the libraries
import numpy as np # for array operations
import pandas as pd # for working with DataFrames
import requests, io # for HTTP requests and I/O commands
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.metrics import mean_squared_error # for calculating the cost function
from sklearn.preprocessing import StandardScaler # for scaling the data
from sklearn.svm import SVR # for building the model

# Importing the dataset from the url of the data set
url = "https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/pressure.csv"
data = requests.get(url).content

# Reading the data
dataset = pd.read_csv(io.StringIO(data.decode('utf-8')), index_col = 'Unnamed: 0')

x = dataset.iloc[:, [0]].values #Temperature values
y = dataset.iloc[:, [1]].values #Pressure values

# Feature Scaling
sc_x = StandardScaler()
x = sc_x.fit_transform(x)

sc_y = StandardScaler()
y = sc_y.fit_transform(y)

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

# Initializing the SVR model with 10 decision trees
model = SVR(kernel = 'rbf')

# Fitting the SVR model to the data
model.fit(x_train, y_train.ravel())

# Predicting the results
y_pred = model.predict(x_test)

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f'))
print("\nRMSE: ", rmse)

In this lesson, we learned about the Support Vector Regression along with its implementation in Python.


This marks the end of Regression section for this course. We will now move on to discuss the concept of classification and implement different kinds of classification algorithms in Machine learning using Python.

Leave a Comment