In the previous lesson, we learned about Simple Linear Regression where we modeled the relationship between a target variable and an independent variable.
However, in practicality, most regression problems have more than one independent variable that determines/influences the value of the dependent variable. In this lesson, we will discuss how to solve such problems using Multiple Linear Regression.
What is Multiple Linear Regression?
Multiple Linear Regression is a linear regression algorithm used in datasets containing a single dependent variable and multiple independent variables. It is also sometimes referred to as linear regression with multiple variables.
If ($x_1, x_2, x_3, \dots, x_n$) be the set of independent variables, the value of the dependent variable ($y$) is modeled by the Multiple Linear Regression algorithm as,
$$y = w_1 * x_1 + w_2 * x_2 +……+ w_n * x_n + b$$
where,
$y$ is the dependent variable,
$ x_1, x_2, \dots, x_n$ are the independent variables,
$ w_1, w_2, \dots, w_n$ are the weights,
$b$ is the bias or the intercept, and
$n$ is a positive integer.
For example, consider the problem of pricing a house. The selling price of a house can depend on a wide range of factors such as its location, area covered by the house, the number of rooms, the year it was built, etc.
House No. | Location | Area (in sq. feet) | Number of Rooms | Built Year | Price (in USD) |
1 | Kathmandu, Nepal | 100,000 | 5 | 2018 | 300,000 |
2 | Bhaktapur, Nepal | 80,000 | 4 | 2018 | 250,000 |
3 | New York, USA | 50,000 | 3 | 2019 | 100,000 |
4 | Kathmandu, Nepal | 120,000 | 6 | 2020 | ? |
Such problems need to consider more than one variable to be able to predict the value of the dependent variable. This is a typical problem where Multiple Linear Regression is used.
(Note: The training process of a Multiple Linear Regression model is the same as a Simple Linear Regression model.)
Multiple Linear Regression in Python
We have already discussed the concept of Multiple Linear Regression, and its application. We will now go through a step-wise Python implementation of the algorithm.
1. Importing necessary libraries
First, let us import some essential Python libraries.
# Importing necessary libraries import numpy as np # for array operations import matplotlib.pyplot as plt # for visualizing data %matplotlib inline # scikit-learn for model building and validation from sklearn.datasets import load_boston # for loading the dataset from sklearn.model_selection import train_test_split # for splitting the data from sklearn.linear_model import LinearRegression # for building the model from sklearn.metrics import mean_squared_error # for calculating the cost function
2. Importing the dataset
For this implementation example, we will be importing a sample dataset from scikit-learn, called the Boston housing prices dataset.
# Loading the dataset dataset = load_boston() # Getting the features (x) and target (y) x = dataset.data y = dataset.target print("Total number of samples in the dataset: {}".format(x.shape[0]))
Total number of samples in the dataset: 506
We can have a proper look at the data by converting it into a pandas DataFrame and using the head() function to display the first five rows.
# Importing pandas for working with DataFrames import pandas as pd # Creating a pandas DataFrame from the loaded dataset df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['TARGET'] = dataset.target # Printing the first five rows of the DataFrame df.head()
As we can see, there are 13 features ($ x_1, x_2, \dots, x_{13}$) in the dataset and a single target variable ($y$).
3. Splitting the dataset into a train set and a test set
We will use the train_test_split() module of scikit-learn for splitting the available data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.
# Splitting the dataset into training and testing set (80/20) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)
If you are confused about why we are splitting the data, please make sure to go through ‘Introduction to Supervised Machine Learning‘.
4. Fitting the model to the training dataset
After splitting the data, let us initialize a Linear Regression model and fit it to the training data. This is done with the help of the LinearRegression() module of scikit-learn.
# Initializing the Linear Regression model model = LinearRegression() # Fitting the Multiple Linear Regression model to the data model.fit(x_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
We have trained (fitted) the model in just two lines of code!
5. Summarizing the model
The goal of model training is to determine the value for the x-coefficient (weight) and the intercept (bias) that results in a straight line that best fits the data distribution. Let us print the value of these variables from the fitted model.
# x-coefficients print("\nCoefficients:\n", model.coef_) # Intercept print("\nIntercept:\n", model.intercept_)
Coefficients: [-9.41693929e-02 4.02843274e-02 4.38808541e-02 2.45921683e+00 -1.66514077e+01 4.55748564e+00 -3.02324498e-03 -1.27668975e+00 2.80805954e-01 -1.16199877e-02 -1.01204495e+00 1.00501337e-02 -4.83886151e-01] Intercept: 30.72849196987436
Since we have 13 features in the training dataset, there are 13 different x-coeeficients (weights), i.e., ($ w_1, w_2, \dots, w_{13}$).
6. Calculating the loss after training
Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).
$$RMSE = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} – \hat{y_{i}})^{2}}$$
where,
$y_i$ is the actual target value,
$\hat{y_{i}}$ is the predicted target value, and
$n$ is the total number of data points.
The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.
# Predicting the target values of the test set y_pred = model.predict(x_test) # RMSE (Root Mean Square Error) as the cost function rmse = float(format(np.sqrt(metrics.mean_squared_error(y_test, y_pred)),'.3f')) print("\nRMSE:\n",rmse)
RMSE: 5.494
A RMSE value of 5.494 indicates that there is some loss in the model. This is quite normal since we are trying to model the relationship of 13 different features with the target variable and a straight line may not fit all the data points exactly.
However, it is necessary to point out that there are various methods available to further minimize the loss of the model but we will not be discussing those in this lesson.
7. Visualizing the results
Let us now visualize the test set results by plotting the actual target values in the test set vs the predicted target values to see how well the model is fitted.
# Plotting the result of actual target values in the test set vs the predicted target values plt.scatter(y_test, y_pred) plt.xlabel('Test data') plt.ylabel('Predicted Y')
Although the predicted target values and the actual target values are not exactly the same, the above graphs looks somewhat linear and our Multiple Linear Regression model seems to be performing okay.
Putting it all together
The final code for the implementation of Multiple Linear Regression in Python is as follows.
# Importing necessary libraries import numpy as np # for array operations import matplotlib.pyplot as plt # for visualizing data %matplotlib inline # scikit-learn for model building and validation from sklearn.datasets import load_boston # for loading the dataset from sklearn.model_selection import train_test_split # for splitting the data from sklearn.linear_model import LinearRegression # for building the model from sklearn.metrics import mean_squared_error # for calculating the cost function # Loading the dataset dataset = load_boston() # Getting the features (x) and target (y) x = dataset.data y = dataset.target print("Total number of samples in the dataset: {}".format(x.shape[0])) # Splitting the dataset into training and testing set (80/20) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28) # Initializing the Linear Regression model model = LinearRegression() # Fitting the Multiple Linear Regression model to the data model.fit(x_train, y_train) # x-coefficients print("\nCoefficients:\n", model.coef_) # Intercept print("\nIntercept:\n", model.intercept_) # Predicting the target values of the test set y_pred = model.predict(x_test) # RMSE (Root Mean Square Error) as the cost function rmse = float(format(np.sqrt(metrics.mean_squared_error(y_test, y_pred)),'.3f')) print("\nRMSE:\n",rmse) # Plotting the result of actual target values in the test set vs the predicted target values plt.scatter(y_test, y_pred) plt.xlabel('Test data') plt.ylabel('Predicted Y')
In this lesson, we discussed the basics of Multiple Linear Regression along with its implementation in Python. In the next lesson, we will discuss about Polynomial Regression.