In the previous lessons, we discussed problems that could be modeled using a straight line. These models were called as 'linear' regression models and worked well on data distributions that showed a linear relationship.
However, not all problems in practicality follow a linear pattern. In this chapter, we will discuss about Polynomial Regression to fit a line for the data points that have a non-linear relationship between them.
Polynomial regression is an approach of modelling the non-linear relationship between an independent variable $(x)$ and a dependent variable $(y)$ using an $n^{th}$ degree polynomial of $x$.
If $x$ be the independent variable and $y$ be the dependent variable, the Polynomial Regression model is represented as,
$$y = w_1 x + w_2 x^2 + w_3 x^3+ \dots + w_n x^n + b$$
where,
$y$ is the dependent variable,
$x$ is the independent variable,
$w_1, w_2, w_3, \dots, w_n$ are the weights,
$b$ is the bias or the intercept, and
$n$ is a positive integer.
Polynomial regression can be thought of as a special case of Multiple Linear Regression model where each independent variable is an $n^{th}$ degree polynomial of a single independent variable. Therefore, it is also sometimes called as Polynomial Linear Regression.
(Note: The training process of a Polynomial Regression model is the same as a Simple Linear Regression model.)
We have discussed why Polynomial Regression is used. We will now go through a step-wise Python implementation of the algorithm. Before we begin to develop a model and understand it, let us import some essential Python libraries.
First, let us import some essential Python libraries.
# Importing necessary libraries import numpy as np # for array operations import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn for model building and validation from sklearn.linear_model import LinearRegression # for building the model from sklearn.metrics import mean_squared_error # for calculating the cost function # Importing libraries for polynomial feature transformation from sklearn.preprocessing import PolynomialFeatures # For creating a pipeline from sklearn.pipeline import Pipeline
For this lesson, we will be creating a dummy data set that follows a curvilinear relationship using NumPy.
# Seeding the NumPy random number generator np.random.seed(20) # Creating a dummy dataset with curvilinear relationship using NumPy x = 20 * np.random.normal(0, 1, 40) y = 5*(-x**2) + np.random.normal(-80, 80, 40)
Let us plot the data set and see the relationship between the variables.
# Plotting the dataset plt.figure(figsize = (10, 5)) plt.scatter(x, y, s = 15) plt.xlabel('Predictor') plt.ylabel('Target') plt.show()
We have seen that the data does not follow a linear relationship. However, let us still try and fit the data using Linear Regression and see how the model performs.
# Initializing and training the Linear Regression model model_lr = LinearRegression() model_lr.fit(x.reshape(-1, 1), y.reshape(-1, 1)) # Predicting the values from the model y_pred_lr = model_lr.predict(x.reshape(-1, 1)) # Plotting the predictions plt.figure(figsize=(10, 5)) plt.scatter(x, y, s = 15) plt.plot(x, y_pred_lr, color='r', label='Linear Regression') plt.xlabel('Predictor') plt.ylabel('Target') plt.show()
Also, calculating the RMSE loss for the Linear Regression model.
rmse = float(format(np.sqrt(mean_squared_error(y, y_pred_lr)), '.3f')) print("\nRMSE for Linear Regression: ", rmse)
RMSE for Linear Regression: 3374.525
Hence, we can clearly see that using a Linear Regression model on all types of data does not give the desired results. The line does not model the relationship between the variables and hence the loss is huge.
Let us now see how a Polynomial Regression model performs for this dataset.
The PolynomialFeatures() module from scikit-learn converts a single independent feature, $x$. into $n^{th}$ degree polynomial features (i.e., $x^1, x^2 ...$ up to $n$ degree) so that we can use the Linear Regression model on the data.
# Creating pipeline and fitting it on data input_features = [('polynomial', PolynomialFeatures(degree = 4)), ('modal', LinearRegression())] model_poly = Pipeline(input_features) model_poly.fit(x.reshape(-1,1), y.reshape(-1,1))
Pipeline(memory=None, steps=[('polynomial', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('modal', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
Here we are setting degree as 4 since we want polynomial features to the 4th degree.
Now let us predict the target values and visualize the results by plotting the test set points and the predictions to learn how well the model performed.
# Predicting the values poly_pred = model_poly.predict(x.reshape(-1,1)) # sorting the predicted values with respect to the variable sorted_zip = sorted(zip(x, poly_pred)) # * is used to unzip the sorted zipped list x_poly, y_pred_poly = zip(*sorted_zip) # Plotting the predictions plt.figure(figsize=(10,6)) plt.scatter(x, y, s=15) plt.plot(x_poly, y_pred_poly, color='g', label='Polynomial Regression') plt.xlabel('Predictor') plt.ylabel('Target') plt.legend() plt.show()
Also, calculating the RMSE loss for the Polynomial Regression model.
# RMSE (Root Mean Square Error) rmse = float(format(np.sqrt(mean_squared_error(y, poly_pred)), '.3f')) print("\nRMSE for Polynomial Regression: ", rmse)
RMSE for Polynomial Regression: 78.021
By comparison, we can see that the Polynomial Regression model performed very well and fit the data pattern properly. Due to this, we can also see that the error metric for Polynomial Regression is very less than that of the Linear Regression model.
The final code for the implementation of Polynomial Regression in Python is as follows.
# Importing necessary libraries import numpy as np # for array operations import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn for model building and validation from sklearn.linear_model import LinearRegression # for building the model from sklearn.metrics import mean_squared_error # for calculating the cost function # Importing libraries for polynomial feature transformation from sklearn.preprocessing import PolynomialFeatures # For creating a pipeline from sklearn.pipeline import Pipeline # Seeding the NumPy random number generator np.random.seed(20) # Creating a dummy dataset with curvilinear relationship using NumPy x = 20 * np.random.normal(0, 1, 40) y = 5*(-x**2) + np.random.normal(-80, 80, 40) # Creating pipeline and fitting it on data input_features = [('polynomial', PolynomialFeatures(degree = 4)), ('modal', LinearRegression())] model_poly = Pipeline(input_features) model_poly.fit(x.reshape(-1,1), y.reshape(-1,1)) # Predicting the values poly_pred = model_poly.predict(x.reshape(-1,1)) # sorting the predicted values with respect to the variable sorted_zip = sorted(zip(x, poly_pred)) x_poly, y_pred_poly = zip(*sorted_zip) # Plotting the predictions plt.figure(figsize=(10,6)) plt.scatter(x, y, s=15) plt.plot(x_poly, y_pred_poly, color='g', label='Polynomial Regression') plt.xlabel('Predictor') plt.ylabel('Target') plt.legend() plt.show() # RMSE (Root Mean Square Error) rmse = float(format(np.sqrt(mean_squared_error(y, poly_pred)), '.3f')) print("\nRMSE for Polynomial Regression: ", rmse)
In this lesson, we learned about polynomial regression along with its implementation in Python. Let us now head on to the next lesson in this course to discuss some of the other kinds of regression algorithms used in Machine Learning.