Decision Tree Regression

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

[latexpage]

So far, we have discussed regression algorithms that involve changing a function’s parametric value (weight and bias) to fit the distribution of the data.

In this lesson, you will be introduced to a different kind of Machine Learning algorithm, called a decision tree regression.


What is a Decision Tree?

A decision tree is one of the most frequently used Machine Learning algorithms for solving regression as well as classification problems. As the name suggests, the algorithm uses a tree-like model of decisions to either predict the target value (regression) or predict the target class (classification). Before diving into how decision trees work, first, let us be familiar with the basic terminologies of a decision tree:

Decision Tree Regression in Machine Learning
  • Root Node: This represents the topmost node of the tree that represents the whole data points.
  • Splitting: It refers to dividing a node into two or more sub-nodes.
  • Decision Node: They are the nodes that are further split into sub-nodes, i.e., this node that is split is called a decision node.
  • Leaf / Terminal Node: Nodes that do not split are called Leaf or Terminal nodes. These nodes are often the final result of the tree.
  • Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  • Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of the parent node. In the figure above, the decision node is the parent of the terminal nodes (child).
  • Pruning: Removing sub-nodes of a decision node is called pruning. Pruning is often done in decision trees to prevent overfitting.

How does a Decision Tree work?

The process of splitting starts at the root node and is followed by a branched tree that finally leads to a leaf node (terminal node) that contains the prediction or the final outcome of the algorithm. Construction of decision trees usually works top-down, by choosing a variable at each step that best splits the set of items. Each sub-tree of the decision tree model can be represented as a binary tree where a decision node splits into two nodes based on the conditions.

Decision trees where the target variable or the terminal node can take continuous values (typically real numbers) are called regression trees which will be discussed in this lesson. If the target variable can take a discrete set of values these trees are called classification trees.


Decision Tree Regression in Python

We will now go through a step-wise Python implementation of the Decision Tree Regression algorithm that we just discussed.

1. Importing necessary libraries

First, let us import some essential Python libraries.

# Importing the libraries
import numpy as np # for array operations
import pandas as pd # for working with DataFrames
import requests, io # for HTTP requests and I/O commands
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.metrics import mean_squared_error # for calculating the cost function
from sklearn.tree import DecisionTreeRegressor # for building the model

2. Importing the data set

For this problem, we will be loading a CSV dataset through a HTTP request (you can also download the dataset from here).

The dataset consists of data related to petrol consumptions (in millions of gallons) for 48 US states. This value is based upon several features such as the petrol tax (in cents), Average income (dollars), paved highways (in miles), and the proportion of the population with a driver’s license. We will be loading the data set using the read_csv() function from the pandas module and store it as a pandas DataFrame object.

# Importing the dataset from the url of the dataset
url = "https://drive.google.com/u/0/uc?id=1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_&export=download"
data = requests.get(url).content

# Reading the data
dataset = pd.read_csv(io.StringIO(data.decode('utf-8')))
dataset.head()
Regression Dataset

3. Separating the features and the target variable

After loading the dataset, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to model the relationships between the features (Petrol_tax, Average_income, etc.) and the target variable (Petrol_consumption) in the dataset.

x = dataset.drop('Petrol_Consumption', axis = 1) # Features
y = dataset['Petrol_Consumption']  # Target

4. Splitting the data into a train set and a test set

We use the train_test_split() module of scikit-learn for splitting the data into a train set and a test set. We will be using 20% of the available data as the testing set and the remaining data as the training set.

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

5. Fitting the model to the training dataset

After splitting the data, let us initialize a Decision Tree Regressor model and fit it to the training data. This is done with the help of DecisionTreeRegressor() module of scikit-learn.

# Initializing the Decision Tree Regression model
model = DecisionTreeRegressor(random_state = 0)

# Fitting the Decision Tree Regression model to the data
model.fit(x_train, y_train) 
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=0, splitter='best')

6. Calculating the loss after training

Let us now calculate the loss between the actual target values in the testing set and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).

$$RMSE = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} – \hat{y_{i}})^{2}}$$

where,
$y_i$ is the actual target value, 
$\hat{y_{i}}$ is the predicted target value, and
$n$ is the total number of data points.

The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)), '.3f'))
print("\nRMSE: ", rmse)
RMSE: 133.351

7. Visualizing the decision tree

After building and executing the model, we can also view the tree structure of the model created using a tool WebGraphviz. We will be copying the content of the ‘tree_structure.dot’ file saved to the local working directory to the input area on the WebGraphviz tool which then generates the visualized structure of our Decision tree.

from sklearn.tree import export_graphviz  

# export the decision tree model to a tree_structure.dot file 
# paste the contents of the file to webgraphviz.com
export_graphviz(model, out_file ='tree_structure.dot', 
               feature_names =['Petrol_tax', 'Average_income', 'Paved_Highways', 'Population_Driver_licence(%)'])  

This is a sample sub-branch of what our Decision Tree looks like.

Decision Tree Regression

Putting it all together

The final code for the implementation of Decision Tree Regression in Python is as follows.

# Importing the libraries
import numpy as np # for array operations
import pandas as pd # for working with DataFrames
import requests, io # for HTTP requests and I/O commands
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split # for splitting the data
from sklearn.metrics import mean_squared_error # for calculating the cost function
from sklearn.tree import DecisionTreeRegressor # for building the model

# Importing the dataset from the url of the data set
url = "https://drive.google.com/u/0/uc?id=1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_&export=download"
data = requests.get(url).content

# Reading the data
dataset = pd.read_csv(io.StringIO(data.decode('utf-8')))
dataset.head()

x = dataset.drop('Petrol_Consumption', axis = 1) # Features
y = dataset['Petrol_Consumption']  # Target

# Splitting the dataset into training and testing set (80/20)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 28)

# Initializing the Decision Tree Regression model
model = DecisionTreeRegressor(random_state = 0)

# Fitting the Decision Tree Regression model to the data
model.fit(x_train, y_train) 

# Predicting the target values of the test set
y_pred = model.predict(x_test)

# RMSE (Root Mean Square Error)
rmse = float(format(np.sqrt(mean_squared_error(y_test, y_pred)),'.3f'))
print("\nRMSE:",rmse)

# Visualizing the decision tree structure
from sklearn.tree import export_graphviz  

# export the decision tree model to a tree_structure.dot file 
# paste the contents of the file to webgraphviz.com
export_graphviz(model, out_file ='tree_structure.dot', 
               feature_names = ['Petrol_tax', 'Average_income', 'Paved_Highways', 'Population_Driver_licence(%)'])  

In this lesson, we discussed the working of the decision tree regression along with its implementation in Python.


Decision trees, however, have a tendency to overfit and have a poor generalization performance. In the next lesson, we will discuss on Random Forest Regression algorithm which is an algorithm that uses a collection of decision trees and performs better than a single decision tree.

Leave a Comment