In the previous lessons, we discussed the types of classification algorithms that involved classifying data based on a specific set of rules and functions that model the data distribution. In this lesson, we will be implementing classification using one of the most widely-used algorithms in Machine Learning, i.e., the Decision Tree Classifier.
A decision tree is a supervised machine learning algorithm that can be used for both regression and classification problems. The algorithm is based on a decision support tool that uses a tree-like model of decisions and their possible consequences.
You can imagine a decision tree as a flowchart-like structure that can be described by the following terminologies:
The construction of a decision tree classifier usually works top-down where a variable is chosen at each step to calculate the best split between the set of variables. The 'best split' is determined by a pre-defined splitting criterion and there are multiple methods that can be used for calculating the splits. We will discuss a few of them in this lesson.
Each sub-tree of the decision tree model can be represented as a binary tree where a decision node splits into two nodes based on the conditions. Decision trees classifiers contain a target variable with a discrete set of values and the final terminal node represents the predicted class.
The accuracy of a decision is based on the splits made and the choice of splitting criterion can make a large difference. Let us take a look at some commonly used splitting criterias of a decision tree classifier.
Entropy is the measure of the randomness of the samples in a given split. If the samples are completely homogeneous then entropy is equal to zero. If the samples are equally divided into the given classes then entropy is equal to one.
As the tree grows deeper, the subsets are more homogeneous and the randomness decreases.
Mathematically, entropy for a single variable is calculated as:
$$ \text{Entropy} = - \sum_{i=1}^{n} {p_i}{log_2}{p_i}$$
Example problem: Find the entropy of the variable $X$, $$X = \{True, True, False, True, False, False, True, True, True \}$$
Solution:
Total number of samples (n)= 9
Number of 'True' instances = 6, and,
Number of 'False' instances = 3
Now, calculating the entropy of variable $X$,
$$ \text{Entropy(X)} = - \sum_{i=1}^{n} {p_i}{log_2}{p_i}$$
$$ = -[({p_{\text{1}}} {log_2} {p_{\text{1}}})+({p_{\text{2}}} {log_2} {p_{\text{2}}})] $$
$$ = -[(\dfrac{6}{9}{log_2}\dfrac{6}{9})+(\dfrac{3}{9}{log_2}\dfrac{3}{9})] $$
$$ = 0.91 $$
Entropy for two or more variables is calculated as:
$$ \text{Entropy} = \sum_{n \in X} P(n) E(n)$$
Example problem: Find the entropy of the following data, i.e. $\text{E(Gender, SUV Purchase)}$
SUV Purchase =Yes | SUV Purchase =No | Total | |
Male | 3 | 4 | 7 |
Female | 1 | 2 | 3 |
Total | 4 | 6 | 10 |
Solution:
Here, Entropy is calculated as,
$$ \text{E(Gender, SUV Purchase)} = \sum_{n \in X} P(n) E(n)$$
$$ = P(\text{Male}) * E(3,4) + P(\text{Female}) * E (1,2) $$
$$ = \dfrac{7}{10} * 0.98 + \dfrac{3}{10} * 0.91 $$
$$ = 0.96 $$
Information Gain is another popular splitting criterion used in a decision tree since it signifies the decrease in entropy of a dataset. As it has an inverse relation with entropy, the higher the homogeneity of the dataset higher is the information gain.
For a set of instances $S$, if $A$ is an attribute and $S_{v}$ is the subset of $X$ with $A = v$ and $Values(A)$ is a set with all possible values of $A$ then, information gain is given by:
$$ \text{Information Gain (S, A)} = Entropy(S) - \sum_{v \in Values(A) } \dfrac{|S_{v}|}{|S|} * Entropy (S) $$
Gini impurity is the measure of a sample being misclassified. It is an easy and popular method used for splitting in the decision tree. While selecting which feature to be chosen as the root of a tree, the Gini impurity of each feature is calculated and compared. As it is the measure of a sample being misclassified, the feature with the lowest Gini impurity is preferred.
Mathematically, gini impurity is calculated as,
$$ \text{Gini Impurity} = 1 -\sum_{i = 1}^{N} p_{i}^{2} $$
Now that we know the basic idea of Decision trees, we will now discuss a step-wise Python implementation of the algorithm.
Before we begin to build the model, let us import some essential Python libraries for mathematical calculations, data loading, preprocessing, and model development and prediction.
# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # For plotting the classification results from mlxtend.plotting import plot_decision_regions
For this problem, we will be loading the Breast Cancer dataset from scikit-learn. The dataset consists of data related to breast cancer patients and their diagnosis (malignant or benign).
# Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) df.head()
print("Total samples in our dataset is: {}".format(df.shape[0]))
Total samples in our dataset is: 569
dataset.describe()
After loading the data set, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the features and the target variable from the above dataset.
For this implementation example, we will only be using the ‘mean perimeter’ and ‘mean texture’ features but you can certainly use all of them.
# Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target Variable y = df['target']
After separating the independent variables ($x$) and dependent variable $(y)$, these values are split into train and test sets to train and evaluate the linear model. We use the train_test_split() module of scikit-learn for splitting the available data into an 80-20 split. We will be using twenty percent of the available data as the test set and the remaining data as the train set.
# Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )
After splitting the data into dependent and independent variables, the Decision Tree Classifier model is fitted with the training data using the DecisiontreeClassifier() class from scikit-learn.
# Fitting Decision Tree Classifier to the Training set model = DecisionTreeClassifier(criterion = 'gini', random_state = 0) model.fit(x_train, y_train)
DecisionTreeClassifier(random_state=0)
Finally, the model is tested on the data to get the predictions.
# Predicting the results y_pred = model.predict(x_test)
Let us now evaluate the model using confusion matrix and calculate its classification accuracy. Confusion matrix determines the performance of the predicted model. Other metrics such as the precision, recall and f1-score are given by the classification report module of scikit-learn.
Precision defines the ratio of correctly predicted positive observations of the total predicted positive observations. It defines how accurate the model is. Recall defines the ratio of correctly predicted positive observations to all observations in the actual class. F1 Score is the weighted average of Precision and Recall and is often used as a metric in place of accuracy for imbalanced datasets.
# Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('Decision Tree Classification Accuracy of the model: {:.2f}%'.format(accuracy*100))
Confusion Matrix [[28 11] [ 8 67]] Classification Report precision recall f1-score support 0 0.78 0.72 0.75 39 1 0.86 0.89 0.88 75 accuracy 0.83 114 macro avg 0.82 0.81 0.81 114 weighted avg 0.83 0.83 0.83 114 Decision Tree Classification Accuracy of the model: 83.33%
Hence, the model is working quite well with an accuracy of 83.33%.
We will now plot the decision boundary of the model on test data.
# Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using Decision Tree Classification (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")
Hence, the plot shows the distinction between the two classes as classified by the Decision Tree Classification algorithm in Python.
The final code for the implementation of Decision Tree Classification in Python is as follows.
# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # Plotting the classification results from mlxtend.plotting import plot_decision_regions # Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) print("Total samples in our dataset is: {}".format(df.shape[0])) # Describe the dataset df.describe() # Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target variable y = df['target'] # Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 ) # Fitting Decision Tree Classifier to the Training set model = DecisionTreeClassifier(criterion = 'gini', random_state = 0) model.fit(x_train, y_train) # Predicting the results y_pred = model.predict(x_test) # Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('Decision Tree Classification Accuracy of the model: {:.2f}%'.format(accuracy*100)) # Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using Decision Tree Classification (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")
In this lesson, we discussed Decision Tree Classifier along with its implementation in Python.
Decision Tree Classifier is an effective model but it often gets overfitted as the tree gets deeper. The problem of overfitting can be reduced by tuning the parameters like maximum depth, minimum samples leaf, or by using ensembling algorithms like Random Forest, which we will discuss in the next chapter.