Polynomial Regression

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

[latexpage]

In the previous lessons, we discussed problems that could be modeled using a straight line. These models were called as ‘linear’ regression models and worked well on data distributions that showed a linear relationship.

However, not all problems in practicality follow a linear pattern. In this chapter, we will discuss about Polynomial Regression to fit a line for the data points that have a non-linear relationship between them.


What is Polynomial Regression?

Polynomial regression is an approach of modelling the non-linear relationship between an independent variable $(x)$ and a dependent variable $(y)$ using an $n^{th}$ degree polynomial of $x$.

If $x$ be the independent variable and $y$ be the dependent variable, the Polynomial Regression model is represented as,

$$y = w_1 x + w_2 x^2 + w_3 x^3+ \dots + w_n x^n + b$$

where,
$y$ is the dependent variable,
$x$ is the independent variable,
$w_1, w_2, w_3, \dots, w_n$ are the weights,
$b$ is the bias or the intercept, and
$n$ is a positive integer.

Polynomial regression can be thought of as a special case of Multiple Linear Regression model where each independent variable is an $n^{th}$ degree polynomial of a single independent variable. Therefore, it is also sometimes called as Polynomial Linear Regression.

(Note: The training process of a Polynomial Regression model is the same as a Simple Linear Regression model.)


Polynomial Regression in Python

We have discussed why Polynomial Regression is used. We will now go through a step-wise Python implementation of the algorithm. Before we begin to develop a model and understand it, let us import some essential Python libraries.

1. Importing necessary libraries

First, let us import some essential Python libraries.

# Importing necessary libraries
import numpy as np # for array operations
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn for model building and validation
from sklearn.linear_model import LinearRegression # for building the model
from sklearn.metrics import mean_squared_error # for calculating the cost function

# Importing libraries for polynomial feature transformation
from sklearn.preprocessing import PolynomialFeatures

# For creating a pipeline
from sklearn.pipeline import Pipeline

2. Creating the dataset

For this lesson, we will be creating a dummy data set that follows a curvilinear relationship using NumPy.

# Seeding the NumPy random number generator
np.random.seed(20)

# Creating a dummy dataset with curvilinear relationship using NumPy
x = 20 * np.random.normal(0, 1, 40)
y = 5*(-x**2) + np.random.normal(-80, 80, 40)

Let us plot the data set and see the relationship between the variables.

# Plotting the dataset
plt.figure(figsize = (10, 5))
plt.scatter(x, y, s = 15)
plt.xlabel('Predictor')
plt.ylabel('Target') 
plt.show()
Polynomial Regression - Figure 1

3. Using a Linear Regression model on the data

We have seen that the data does not follow a linear relationship. However, let us still try and fit the data using Linear Regression and see how the model performs.

# Initializing and training the Linear Regression model
model_lr = LinearRegression()
model_lr.fit(x.reshape(-1, 1), y.reshape(-1, 1))

# Predicting the values from the model
y_pred_lr = model_lr.predict(x.reshape(-1, 1))

# Plotting the predictions
plt.figure(figsize=(10, 5))
plt.scatter(x, y, s = 15)
plt.plot(x, y_pred_lr, color='r', label='Linear Regression')
plt.xlabel('Predictor')
plt.ylabel('Target')
plt.show()
Polynomial Regression - Figure 2

Also, calculating the RMSE loss for the Linear Regression model.

rmse = float(format(np.sqrt(mean_squared_error(y, y_pred_lr)), '.3f'))
print("\nRMSE for Linear Regression: ", rmse)
RMSE for Linear Regression: 3374.525

Hence, we can clearly see that using a Linear Regression model on all types of data does not give the desired results. The line does not model the relationship between the variables and hence the loss is huge.

4. Using a Polynomial Regression model on the data

Let us now see how a Polynomial Regression model performs for this dataset.

The PolynomialFeatures() module from scikit-learn converts a single independent feature, $x$. into $n^{th}$ degree polynomial features (i.e., $x^1, x^2 …$ up to $n$ degree) so that we can use the Linear Regression model on the data.

# Creating pipeline and fitting it on data
input_features = [('polynomial', PolynomialFeatures(degree = 4)), ('modal', LinearRegression())]
model_poly = Pipeline(input_features)
model_poly.fit(x.reshape(-1,1), y.reshape(-1,1))
Pipeline(memory=None, steps=[('polynomial', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('modal', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

Here we are setting degree as 4 since we want polynomial features to the 4th degree.

Now let us predict the target values and visualize the results by plotting the test set points and the predictions to learn how well the model performed.

# Predicting the values
poly_pred = model_poly.predict(x.reshape(-1,1))

# sorting the predicted values with respect to the variable
sorted_zip = sorted(zip(x, poly_pred))

# * is used to unzip the sorted zipped list
x_poly, y_pred_poly = zip(*sorted_zip)

# Plotting the predictions
plt.figure(figsize=(10,6))
plt.scatter(x, y, s=15)
plt.plot(x_poly, y_pred_poly, color='g', label='Polynomial Regression')
plt.xlabel('Predictor')
plt.ylabel('Target')
plt.legend()
plt.show()
Polynomial Regression - Figure 3

Also, calculating the RMSE loss for the Polynomial Regression model.

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y, poly_pred)), '.3f'))
print("\nRMSE for Polynomial Regression: ", rmse)
RMSE for Polynomial Regression: 78.021

By comparison, we can see that the Polynomial Regression model performed very well and fit the data pattern properly. Due to this, we can also see that the error metric for Polynomial Regression is very less than that of the Linear Regression model.


Putting it all together

The final code for the implementation of Polynomial Regression in Python is as follows.

# Importing necessary libraries
import numpy as np # for array operations
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn for model building and validation
from sklearn.linear_model import LinearRegression # for building the model
from sklearn.metrics import mean_squared_error # for calculating the cost function

# Importing libraries for polynomial feature transformation
from sklearn.preprocessing import PolynomialFeatures

# For creating a pipeline
from sklearn.pipeline import Pipeline

# Seeding the NumPy random number generator
np.random.seed(20)

# Creating a dummy dataset with curvilinear relationship using NumPy
x = 20 * np.random.normal(0, 1, 40)
y = 5*(-x**2) + np.random.normal(-80, 80, 40)

# Creating pipeline and fitting it on data
input_features = [('polynomial', PolynomialFeatures(degree = 4)), ('modal', LinearRegression())]
model_poly = Pipeline(input_features)
model_poly.fit(x.reshape(-1,1), y.reshape(-1,1))

# Predicting the values
poly_pred = model_poly.predict(x.reshape(-1,1))

# sorting the predicted values with respect to the variable
sorted_zip = sorted(zip(x, poly_pred))
x_poly, y_pred_poly = zip(*sorted_zip)

# Plotting the predictions
plt.figure(figsize=(10,6))
plt.scatter(x, y, s=15)
plt.plot(x_poly, y_pred_poly, color='g', label='Polynomial Regression')
plt.xlabel('Predictor')
plt.ylabel('Target')
plt.legend()
plt.show()

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y, poly_pred)), '.3f'))
print("\nRMSE for Polynomial Regression: ", rmse)

In this lesson, we learned about polynomial regression along with its implementation in Python. Let us now head on to the next lesson in this course to discuss some of the other kinds of regression algorithms used in Machine Learning.

Leave a Comment