In this lesson, we will be focusing on an intuitive explanation of the K-Nearest Neighbours (KNN) algorithm for classification which is commonly used for multiclass classification problems. We will then go through a detailed implementation of the algorithm in Python using a real-world dataset.
How does a K-Nearest Neighbours (KNN) Classifier work?
[latexpage]K-Nearest Neighbours (KNN) Classifier assumes that ‘k’ data points with similar characteristics exist close to each other and follow a similar pattern.
Thus, to find the class of a new data point, we can simply look at the classes of the neighbouring K data points. The classification is done by a plurality vote of its neighbours, with the object being assigned to the class most common among its K nearest neighbours.
Let us suppose a multi-class problem with three classes (Class 1, 2, and 3) as shown in the figure below.
Now, let’s say we have to predict the class of a new data point (indicated by the white star in the graph below).
To classify the new data point, the algorithm computes the distance of K nearest neighbours, i.e., K data points that are the nearest to the new data point. Here, K is set as 4.
Among the K neighbours, the class with the most number of data points is predicted as the class of the new data point. For the above example, Class 3 (blue) has the most number of data points (2) inside the boundary of the KNN. Hence, the data point is classified to be Class 3.
Choosing a larger K value results in smoother curves of separation and less complex models. On the contrary, choosing a smaller K value generally results in overfitted data and complex models.
K-Nearest Neighbours (KNN) Classification in Python
This section will guide you through a step-wise Python implementation of the classification process that we just discussed.
1. Importing necessary libraries
Let us start by importing some of the necessary libraries and modules required.
# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # For plotting the classification results from mlxtend.plotting import plot_decision_regions
2. Importing the data set
For this problem, we will be loading the Breast Cancer dataset from scikit-learn. The dataset consists of data related to breast cancer patients and their diagnosis (malignant or benign).
# Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) df.head()
print("Total samples in our dataset is: {}".format(df.shape[0]))
Total samples in our dataset is: 569
dataset.describe()
3. Separating the features and target variables
After loading the data set, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the features and the target variable from the above dataset.
For this implementation example, we will only be using the ‘mean perimeter’ and ‘mean texture’ features but you can certainly use all of them.
# Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target Variable y = df['target']
4. Splitting the dataset into training and test sets
After separating the independent variables ($x$) and dependent variable $(y)$, these values are split into train and test sets to train and evaluate the linear model. We use the train_test_split() module of scikit-learn for splitting the available data into an 80-20 split. We will be using twenty percent of the available data as the test set and the remaining data as the train set.
# Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )
5. Fitting the model to the training set
After splitting the data into dependent and independent variables, the KNN Classification model is fitted with the training data using the KNeighboursClassifier() class from scikit-learn.
# Fitting KNN Classifier to the Training set model = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) model.fit(x_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
6. Predicting the test results
Finally, the model is tested on the data to get the predictions.
# Predicting the results y_pred = model.predict(x_test)
7. Evaluating the model
Let us now evaluate the model using confusion matrix and calculate its classification accuracy. Confusion matrix determines the performance of the predicted model. Other metrics such as the precision, recall and f1-score are given by the classification report module of scikit-learn.
Precision defines the ratio of correctly predicted positive observations of the total predicted positive observations. It defines how accurate the model is. Recall defines the ratio of correctly predicted positive observations to all observations in the actual class. F1 Score is the weighted average of Precision and Recall and is often used as a metric in place of accuracy for imbalanced datasets.
# Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('KNN Classification Accuracy of the model: {:.2f}%'.format(accuracy*100))
Confusion Matrix [[31 8] [ 3 72]] Classification Report precision recall f1-score support 0 0.91 0.79 0.85 39 1 0.90 0.96 0.93 75 accuracy 0.90 114 macro avg 0.91 0.88 0.89 114 weighted avg 0.90 0.90 0.90 114 KNN Classification Accuracy of the model: 90.35%
Hence, the model is working well with an accuracy of 90.35%.
8. Plotting the decision boundary
We will now plot the decision boundary of the model on test data.
# Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using KNN Classification (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")
Hence, the plot shows the distinction between the two classes as classified by the KNN Classification algorithm in Python. The decision boundary is not a straight line like in logistic regression since we are not using a linear function.
Putting it all together
The final code for the implementation of K-Nearest Neighbours (KNN) Classification in Python is as follows.
# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # Plotting the classification results from mlxtend.plotting import plot_decision_regions # Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) print("Total samples in our dataset is: {}".format(df.shape[0])) # Describe the dataset df.describe() # Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target variable y = df['target'] # Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 ) # Fitting KNN Classifier to the Training set model = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) model.fit(x_train, y_train) # Predicting the results y_pred = model.predict(x_test) # Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('KNN Classification Accuracy of the model: {:.2f}%'.format(accuracy*100)) # Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using KNN Classification (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")
In this lesson, we discussed K-Nearest Neighbours (KNN) Classifier for binary classification along with its implementation in Python. In the next lesson of the course, we will be learning about the Naive Bayes Classifier and its implementation.