Logistic regression is one of the most commonly used classification techniques in Machine Learning. It is a mathematical model that uses a logistic function to model the relationship between a binary set of classes and the available features.
In simple words, a Logistic Regression algorithm is used to find the probability of a set of data points falling into a specific class. The keyword here is 'probability', since, we will be trying to find how likely a set of data points belongs to one class versus the other. So, the output of a Logistic Regression algorithm is always between 0 and 1.
In this lesson we will be discussing how the algorithm works, followed by its implementation in Python using a real-world dataset.
[latexpage]A logistic function or logistic curve is a common "S" shape (sigmoid curve), with the equation given as,
$$f(x) = \frac{L}{1 + e^{-k(x-x_0)}}$$
where,
$e$ is the natural logarithm base (also known as Euler's number),
$x_0$ is the x-value of the sigmoid's midpoint,
$L$ is the curve's maximum value, and
$k$ is the logistic growth rate or steepness of the curve.
A standard logistic function, also known as the sigmoid function has the values $L=1, k=1$ and $x_0=0$ . Hence, the function is given by,
$$f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = \frac{1}{2} + \frac{1}{2} tanh (\frac{x}{2})$$
One thing to note is that the above given logistic curve is actually asymptotic and the output value of the logistic function can never be exactly 0 or 1.
To perform logistic regression, the logistic function is simply applied to the linear function that we had studied before and an output probability is obtained between 0 and 1.
Let us assume that variable $y$ is a linear function of a single explanatory variable $x$. A linear function $f(x)$ can be used to express $y$ as,
$$y = f(x) = w_0 + w_{1}x \quad \dots (i)$$
where,
equation ($i$) is also called the logit,
$y$ is the independent variable,
$x$ is the dependent variable,
$w_1$ is the weight or the slope, and
$w_0$ is the bias or intercept.
The logistic regression function $p(x)$ is then given by the sigmoid of the linear function as,
$$p(x) = \frac{1}{1 + e^{−f(x)}} = \frac{1}{1 + e^{−(w_0 + w_{1}x)}}$$
Logistic regression is used to determine the best estimates for the values of $w_0$ and $w_1$ such that the value of function $p(x)$ represents the predicted probability of the output.
In a binary classification problem of two classes 'A' and 'B', if the output probability is greater or equal to 0.5, then, the class is determined to be 'A' and if the output is less than 0.5 then the output class is 'B'.
Now that we know the basic idea of Logistic Regression, we will discuss a step-wise Python implementation of the algorithm.
Before we begin to develop the logistic regression model, let us import some essential Python libraries for mathematical calculations, data loading, preprocessing, and model development and prediction.
# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # For plotting the classification results from mlxtend.plotting import plot_decision_regions
For this problem, we will be loading the Breast Cancer dataset from scikit-learn. The dataset consists of data related to breast cancer patients and their diagnosis (malignant or benign).
# Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) df.head()
print("Total samples in our dataset is: {}".format(df.shape[0]))
Total samples in our dataset is: 569
dataset.describe()
After loading the data set, the independent variable ($x$) and the dependent variable ($y$) need to be separated. Our concern is to find the relationships between the features and the target variable in the dataset.
For this implementation example, we will only be using the 'mean perimeter' and 'mean texture' features but you can certainly use all of them.
# Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target Variable y = df['target']
After separating the independent variables ($x$) and dependent variable $(y)$, these values are split into train and test sets to train and evaluate the linear model. We use the train_test_split() module of scikit-learn for splitting the available data into an 80-20 split. We will be using twenty percent of the available data as the test set and the remaining data as the train set.
# Splitting the dataset into the training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25 )
After splitting the data into dependent and independent variables, the Logistic Regression model is fitted with the training data using the LogisticRegression() class from scikit-learn.
# Fitting Logistic Regression to the Training set model = LogisticRegression(random_state = 0, solver='lbfgs') model.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=0, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
Finally, the model is tested on the data to get the predictions.
# Predicting the results y_pred = model.predict(x_test)
Let us now evaluate the model using confusion matrix and calculate its classification accuracy. Confusion matrix determines the performance of the predicted model. Other metrics such as the precision, recall and f1-score are given by the classification report module of scikit-learn.
Precision defines the ratio of correctly predicted positive observations of the total predicted positive observations. It defines how accurate the model is. Recall defines the ratio of correctly predicted positive observations to all observations in the actual class. F1 Score is the weighted average of Precision and Recall and is often used as a metric in place of accuracy for imbalanced datasets.
# Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('Logistic Regression Accuracy of the model: {:.2f}%'.format(accuracy*100))
Confusion Matrix [[30 9] [ 4 71]] Classification Report precision recall f1-score support 0 0.88 0.77 0.82 39 1 0.89 0.95 0.92 75 accuracy 0.89 114 macro avg 0.88 0.86 0.87 114 weighted avg 0.89 0.89 0.88 114 Logistic Regression Accuracy of the model: 88.60%
Hence, the model is working quite well with an accuracy of 88.60%.
We will now plot the decision boundary of the model on test data.
# Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using Logistic Regression (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")
Hence, the plot shows the distinction between the two classes as classified by the Logistic Regression algorithm in Python. As we can see, since the Logistic Regression algorithm uses a linear function, the obtained boundary line is also linear.
The final code for the implementation of Logistic Regression in Python is as follows.
# Importing the libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # scikit-learn modules from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, accuracy_score, classification_report # For plotting the classification results from mlxtend.plotting import plot_decision_regions # Importing the dataset dataset = load_breast_cancer() # Converting to pandas DataFrame df = pd.DataFrame(dataset.data, columns = dataset.feature_names) df['target'] = pd.Series(dataset.target) print("Total samples in our dataset is: {}".format(df.shape[0])) # Describe the dataset df.describe() # Selecting the features features = ['mean perimeter', 'mean texture'] x = df[features] # Target Variable y = df['target'] # Splitting the dataset into training and test set x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 25) # Fitting Logistic Regression to the Training set model = LogisticRegression(random_state = 0, solver='lbfgs') model.fit(x_train, y_train) # Predicting the Test set results y_pred = model.predict(x_test) # Confusion matrix print("Confusion Matrix") matrix = confusion_matrix(y_test, y_pred) print(matrix) # Classification Report print("\nClassification Report") report = classification_report(y_test, y_pred) print(report) # Accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('Logistic Regression Accuracy of Scikit Model: {:.2f}%'.format(accuracy*100)) # Plotting the decision boundary plot_decision_regions(x_test.values, y_test.values, clf = model, legend = 2) plt.title("Decision boundary using Logistic Regression (Test)") plt.xlabel("mean_perimeter") plt.ylabel("mean_texture")
In this chapter, we discussed Logistic Regression along with its implementation in Python. We will now move on to discuss other interesting classification algorithms in the upcoming lessons.