Naive Bayes Classifier

Greetings! Some links on this site are affiliate links. That means that, if you choose to make a purchase, The Click Reader may earn a small commission at no extra cost to you. We greatly appreciate your support!

The Naive Bayes classifier is a probabilistic classifier that is based on the Bayes theorem. The classifier is generally preferred for high-dimensional data sets due to its simplicity and speed.

In this lesson, we will focus on an intuitive explanation of how Naive Bayes classifiers work, followed by its implementation in Python using a real-world dataset.

Bayes Theorem

The Bayes theorem describes the probability of occurrence of an event, based on prior knowledge of conditions that might be related to the event. It is stated mathematically as the following equation,

[latexpage]

$$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$$

where,
– $A$ and $B$ are events and $P(B)!=0$,
– $P(A | B)$ is a conditional probability that defines the likelihood of event $A$ occurring given that $B$ is true, also – known as posterior probability,
– $P(B | A)$ is a conditional probability that defines the likelihood of event $B$ occurring given that $A$ is true, also known as the likelihood, and
– $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ independently of each other, also known as prior probability

How does Naive Bayes Classifier work?

For output label $y$ as $L$ and features $x_1, x_2, \dots, \x_n$ as $F$, the Bayes’s Theorem can be written as,

$$P(y | (x_1, x_2, \dots, x_n)) = P(L | F) = \frac{P(F | L ) * P( L )}{P(F)}$$

where
– $P(L | F)$ is the probability of an output class ($y$) given the provided data or features $(x_1, x_2, \dots, x_n)$, known as the posterior probability,
– $P(F | L)$ is the probability of features $(x_1, x_2, \dots, x_n)$ given an output class ($y$), known as the likelihood, and
– $P(F)$ and $P(L)$ are the probabilities of observing $F$ and $L$ independently of each other, also known as the prior probability

The Naive Bayes Classifier assumes that a particular feature in a class is independent of other features due to which it gets its name to be “Naive”. The presence or absence of a feature does not affect other features in the data. It calculates the probability of an event in the following simple steps:

Step 1: Calculate the prior probability for given class labels, i.e., the class prior probability ($P(L)$), and the predictor prior probability ($P(F)$).

Step 2: Find likelihood probability with each attribute for each class, i.e., $P(F | L)$ for each set of features in $F$.

Step 3: Use Bayes Theorem to find the posterior probability, i.e., $P(L | F)$.

Step 4: The class with a higher probability, given the input, belongs to the higher probability class.

Naive Bayes Classifier in Python

Now that we know the basic idea of Naive Bayes Classifier, we will now discuss a step-wise Python implementation of the algorithm.

1. Importing necessary libraries

Before we begin to build our model, let us import some essential Python libraries for mathematical calculations, data loading, preprocessing, and model development and prediction.

# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# For plotting the classification results
from mlxtend.plotting import plot_decision_regions

2. Importing the dataset

For this problem, we will be loading the Breast Cancer dataset from scikit-learn. The dataset consists of data related to breast cancer patients and their diagnosis (malignant or benign).

# Importing the dataset

# Converting to pandas DataFrame
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df.head()
print("Total samples in our dataset is: {}".format(df.shape[0]))
Total samples in our dataset is: 569
dataset.describe()

3. Separating the features and target variable

After loading the data set, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the features and the target variable from the above dataset.

For this implementation example, we will only be using the ‘mean perimeter’ and ‘mean texture’ features but you can certainly use all of them.

# Selecting the features
features = ['mean perimeter', 'mean texture']
x = df[features]

# Target Variable
y = df['target']

4. Splitting the dataset into training and test set

After separating the independent variables ($x$) and dependent variable $(y)$, these values are split into train and test sets to train and evaluate the linear model. We use the train_test_split() module of scikit-learn for splitting the available data into an 80-20 split. We will be using twenty percent of the available data as the test set and the remaining data as the train set.

# Splitting the dataset into the training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )

5. Fitting the model to the training set

For building the model, we will be using the Gaussian Naive Bayes model. It is the easiest Naive Bayes classifier that assumes that the data from each label is drawn from a simple Gaussian distribution. After splitting the data into dependent and independent variables, the Naive Bayes model is fitted with the training data using the GaussianNB() class from scikit-learn.

# Fitting Naive Bayes to the Training set
model = GaussianNB()
model.fit(x_train, y_train)
GaussianNB()

6. Predicting the test results

Finally, the model is tested on the data to get the predictions.

# Predicting the results
y_pred = model.predict(x_test)

7. Evaluating the model

Let us now evaluate the model using confusion matrix and calculate its classification accuracy. Confusion matrix determines the performance of the predicted model. Other metrics such as the precision, recall and f1-score are given by the classification report module of scikit-learn.

Precision defines the ratio of correctly predicted positive observations of the total predicted positive observations. It defines how accurate the model is. Recall defines the ratio of correctly predicted positive observations to all observations in the actual class. F1 Score is the weighted average of Precision and Recall and is often used as a metric in place of accuracy for imbalanced datasets.

# Confusion matrix
print("Confusion Matrix")
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

# Classification Report
print("\nClassification Report")
report = classification_report(y_test, y_pred)
print(report)

# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Gaussian NB Classification Accuracy of the model: {:.2f}%'.format(accuracy*100))
Confusion Matrix
[[30 9]
[ 5 70]]

Classification Report
precision    recall    f1-score    support
0      0.86      0.77        0.81         39
1      0.89      0.93        0.91         75
accuracy                            0.88        114
macro avg      0.87      0.85        0.86        114
weighted avg      0.88      0.88        0.88        114

Gaussian NB Classification Accuracy of the model: 87.72%

Hence, the model is working quite well with an accuracy of 87.72%.

8. Plotting the decision boundary

We will now plot the decision boundary of the model on test data.

# Plotting the decision boundary
plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2)
plt.title("Decision boundary using Naive Bayes (Test)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture")

Hence, the plot shows the distinction between the two classes as classified by the Naive Bayes Classification algorithm in Python.

Putting it all together

The final code for the implementation of Naive Bayes Classification in Python is as follows.

# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# scikit-learn modules
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Plotting the classification results
from mlxtend.plotting import plot_decision_regions

# Importing the dataset

# Converting to pandas DataFrame
df = pd.DataFrame(dataset.data, columns = dataset.feature_names)
df['target'] = pd.Series(dataset.target)

print("Total samples in our dataset is: {}".format(df.shape[0]))

# Describe the dataset
df.describe()

# Selecting the features
features = ['mean perimeter', 'mean texture']
x = df[features]

# Target variable
y = df['target']

# Splitting the dataset into the training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )

# Fitting Naive Bayes to the Training set
model = GaussianNB()
model.fit(x_train, y_train)

# Predicting the results
y_pred = model.predict(x_test)

# Confusion matrix
print("Confusion Matrix")
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

# Classification Report
print("\nClassification Report")
report = classification_report(y_test, y_pred)
print(report)

# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Gaussian NB Classification Accuracy of the model: {:.2f}%'.format(accuracy*100))

# Plotting the decision boundary
plt.figure(figsize=(10,6))
plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2)
plt.title("Decision boundary using Naive Bayes (Test)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture")

In this lesson, we discussed Naive Bayes Classifier along with its implementation in Python.

We will now discuss some tree-based methods for classification such as the Decision Tree and Random Forest in the upcoming lessons.