Thanks to theidioms.com

Supervised Machine Learning with Python (Course VI)

Supervised Machine Learning with Python (Course VI)

Random Forest Regression

In the previous lesson, we discussed Decision Trees and its implementation in Python. We also mentioned the downside of using Decision trees is their tendency to overfit as they are highly sensitive to small changes in data.

In this lesson, we are going to learn about Random Forests that are essentially a collection of many decision trees. Random forests or random decision forests are an ensemble learning method that uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms mostly for solving classification and regression problems.

It is hence one of the most powerful Machine Learning algorithms and is commonly used for various tasks and kinds of data. Random forest operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

What is Random Forest Regression?

Random Forest Regression algorithms are a class of Machine Learning algorithms that use the combination of multiple random decision trees each trained on a subset of data. The use of multiple trees gives stability to the algorithm and reduce variance. The random forest regression algorithm is a commonly used model due to its ability to work well for large and most kinds of data.

The algorithm creates each tree from a different sample of input data. At each node, a different sample of features is selected for splitting and the trees run in parallel without any interaction. The predictions from each of the trees are then averaged to produce a single result which is the prediction of the Random Forest.

Random Forest Regression in Python

This section will walk you through a step-wise Python implementation of the Random Forest prediction process that we just discussed.

1. Importing necessary libraries

First, let us import some essential Python libraries.

# Importing the libraries
import numpy as np # for array operations
import pandas as pd # for working with DataFrames
import requests, io # for HTTP requests and I/O commands
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.metrics import mean_squared_error # for calculating the cost function
from sklearn.ensemble import RandomForestRegressor # for building the model
2. Importing the dataset

The dataset consists of data related to petrol consumptions (in millions of gallons) for 48 US states. This value is based upon several features such as the petrol tax (in cents), Average income (dollars), paved highways (in miles), and the proportion of the population with a driver’s license. We will be loading the data set using the read_csv() function from the pandas module and store it as a pandas DataFrame object.

# Importing the dataset from the url of the data set
data = requests.get(url).content
dataset.head()
3. Separating the features and the target variable

After loading the dataset, the independent variable () and the dependent variable () need to be separated. Our concern is to model the relationships between the features (Petrol_tax, Average_income, etc.) and the target variable (Petrol_consumption) in the dataset.

x = dataset.drop('Petrol_Consumption', axis = 1) # Features
y = dataset['Petrol_Consumption']  # Target
4. Splitting the data into a train set and a test set

We use the train_test_split() module of scikit-learn for splitting the data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)
5. Fitting the model to the training dataset

After splitting the data, let us initialize a Random Forest Regression model and fit it to the training data. This is done with the help of RandomForestRegressor()module of scikit-learn.

# Initializing the Random Forest Regression model with 10 decision trees
model = RandomForestRegressor(n_estimators = 10, random_state = 0)

# Fitting the Random Forest Regression model to the data
model.fit(x_train, y_train) 
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False)
6. Calculating the loss after training

Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).

where,
is the actual target value,
is the predicted target value, and
is the total number of data points.

The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f'))
print("\nRMSE: ", rmse)
RMSE: 96.389

As we can see, the value of error metric has decreased significantly and this model performed quite well than the single decision tree regression model that we studied in the previous lesson.

Putting it all together

The final code for the implementation of Random Forest Regression in Python is as follows.

# Importing the libraries
import numpy as np # for array operations
import pandas as pd # for working with DataFrames
import requests, io # for HTTP requests and I/O commands
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.metrics import mean_squared_error # for calculating the cost function
from sklearn.ensemble import RandomForestRegressor # for building the model

# Importing the dataset from the url of the data set
data = requests.get(url).content

x = dataset.drop('Petrol_Consumption', axis = 1) # Features
y = dataset['Petrol_Consumption']  # Target

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

# Initializing the Random Forest Regression model with 10 decision trees
model = RandomForestRegressor(n_estimators = 10, random_state = 0)

# Fitting the Random Forest Regression model to the data
model.fit(x_train, y_train)

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)),'.3f'))
print("\nRMSE:\n",rmse)

Hence, the Random Forest Regression algorithm is a powerful Machine Learning algorithm that does not require a lot of parameter tuning and is capable of capturing a broader picture of the data. However, the model when applied to larger data sets can become time-consuming and requires large computational power.

In the upcoming final lesson of the regression section, we will be discussing a new class of regression algorithm which is based on Support Vector Machines.