Coronavirus (COVID-19) Detection Using Machine Learning and Python – Full Tutorial

In this tutorial, we will be exploring how we can use Computer Vision and Machine Learning to detect Coronavirus (COVID-19) cases on chest X-ray images with Python.

More and more people are being diagnosed daily with Coronavirus in different parts of the world. The inventory of testing kits are decreasing rapidly and there is a constant need for new kits to be manufactured. How nice would it be if we could find a reliable testing mechanism to act as an alternative for the testing of the Coronavirus?

In this Machine Learning tutorial, we will be using chest X-rays to build a deep learning model capable of detecting the Coronavirus in a manner that is similar to how radiologists detect various lung diseases. Also, the credits for this tutorial goes to the following kernel that we found on Kaggle: Covid-19 Detection from Lung X-rays.

Please note that this method of testing is made for educational purposes and is not at all recommended in practice. We hope that this tutorial will help data science aspirants as a starting point for their research.

Building a Convolutional Neural Network (CNN) to detect coronavirus

We will be building our own Convolutional Neural Network (CNN) architecture for detecting the Coronavirus (COVID-19) using Keras and TensorFlow. If you do not know what a CNN is, we would suggest you go through this free course before going through this tutorial: ‘Convolutional Neural Network Theoretical Course’.

1. Importing necessary libraries

First, let us import some essential Python libraries for building our model, pre-processing training images, and so on.

# Importing Keras libraries
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing.image import load_img, img_to_array
from keras.models import Sequential, Model
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Activation, Dropout, BatchNormalization
from keras.layers import AvgPool2D, MaxPool2D, Flatten, Dense
from keras.models import Sequential, Model
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.optimizers import RMSprop

# Importing TensorFlow
import tensorflow as tf

# Importing confusion matrix function from scikit-learn
from sklearn.metrics import confusion_matrix

# Importing common Python libraries
import os
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline
2. Importing the dataset

For this tutorial, we will be using the COVID-19 chest X-ray images dataset available on Kaggle. You can download the dataset from here.

To follow along with this tutorial, please download the dataset and keep it in your working folder.

# The directory of the dataset
DATASET_DIR = "covid-19-x-ray-10000-images/dataset"

# Getting all the chest X-ray images of COVID-19 negative patients
normal_images = []
for img_path in glob.glob(DATASET_DIR + '/normal/*'):

# Plotting chest X-ray image of a COVID-19 negative patient
fig = plt.figure()
plt.imshow(normal_images[0], cmap='gray') 

# Getting all the chest X-ray images of COVID-19 positive patients
covid_images = []
for img_path in glob.glob(DATASET_DIR + '/covid/*'):

# Plotting chest X-ray image of a COVID-19 positive patients
fig = plt.figure()
plt.imshow(covid_images[0], cmap='gray') 
Now, let us see how many images in the dataset are of COVID-19 negative patients and how many images are of COVID-19 positive patients.

print(f"COVID-19 negative patients: {len(normal_images)}")
print(f"COVID-19 positive patients: {len(covid_images)}")
COVID-19 negative patients: 28 
COVID-19 positive patients: 70

So, we are working with a rather small dataset which is quite imbalanced, i.e., the number of images in both classes are not similar.

3. Initializing the model

We will be building our own Convolutional Neural Network (CNN) model for this tutorial. First, let us define some parameters for the model.

# Width, height and color channels of input image
IMG_W = 150
IMG_H = 150

# Shape of input image

# Number of classes to classify (negative and positive)

# Number of epochs and batch size

Next, let us initialize our model with an architecture that will consist of 5 hidden layers and 1 fully connected layer. Note that the choice of architecture is based on hit-and-trial method and you can experiment with other architectures as well.

# Creating a sequential model
model = Sequential()

# Hidden Layer 1
model.add(Conv2D(32, (3, 3), input_shape=INPUT_SHAPE))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Hidden Layer 2
model.add(Conv2D(32, (3, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Hidden Layer 3
# Hidden Layer 4

# Hidden Layer 5
# Fully Connected Layer

# Compiling the model
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

# Looking at the model summary
Model: "sequential_1" 
Layer (type) Output Shape Param # ================================================================= 
conv2d_1 (Conv2D) (None, 148, 148, 32) 896 _________________________________________________________________ 
activation_1 (Activation) (None, 148, 148, 32) 0 _________________________________________________________________ 
max_pooling2d_1 (MaxPooling2 (None, 74, 74, 32) 0 _________________________________________________________________ 
conv2d_2 (Conv2D) (None, 72, 72, 32) 9248 _________________________________________________________________ 
activation_2 (Activation) (None, 72, 72, 32) 0 _________________________________________________________________ 
max_pooling2d_2 (MaxPooling2 (None, 36, 36, 32) 0 _________________________________________________________________ 
conv2d_3 (Conv2D) (None, 34, 34, 64) 18496 _________________________________________________________________ 
activation_3 (Activation) (None, 34, 34, 64) 0 _________________________________________________________________ 
conv2d_4 (Conv2D) (None, 32, 32, 250) 144250 _________________________________________________________________ 
activation_4 (Activation) (None, 32, 32, 250) 0 _________________________________________________________________ 
conv2d_5 (Conv2D) (None, 30, 30, 128) 288128 _________________________________________________________________ 
activation_5 (Activation) (None, 30, 30, 128) 0 _________________________________________________________________ 
average_pooling2d_1 (Average (None, 15, 15, 128) 0 _________________________________________________________________ 
conv2d_6 (Conv2D) (None, 13, 13, 64) 73792 _________________________________________________________________ 
activation_6 (Activation) (None, 13, 13, 64) 0 _________________________________________________________________ 
average_pooling2d_2 (Average (None, 6, 6, 64) 0 _________________________________________________________________ 
conv2d_7 (Conv2D) (None, 5, 5, 256) 65792 _________________________________________________________________ 
activation_7 (Activation) (None, 5, 5, 256) 0 _________________________________________________________________ 
max_pooling2d_3 (MaxPooling2 (None, 2, 2, 256) 0 _________________________________________________________________ 
flatten_1 (Flatten) (None, 1024) 0 _________________________________________________________________ 
dense_1 (Dense) (None, 32) 32800 _________________________________________________________________ 
dropout_1 (Dropout) (None, 32) 0 _________________________________________________________________ 
dense_2 (Dense) (None, 1) 33 
activation_8 (Activation) (None, 1) 0 ================================================================= 
Total params: 633,435 
Trainable params: 633,435 
Non-trainable params: 0 

Great! We have initialized our CNN model for training.

4. Training the model

As the dataset is very small, we will be using some image augmentation techniques (shearing, zooming, flipping, etc.) to make the dataset more varied. Then, we will be training the model on the augmented images as well as the real images.

# Initializing the training data generator
train_datagen = ImageDataGenerator(rescale=1./255,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True,
    validation_split = 0.3)

# Choosing the training directory for data generator
train_generator = train_datagen.flow_from_directory(
    target_size = (IMG_H, IMG_W),
    batch_size = BATCH_SIZE,
    class_mode = 'binary',
    subset = 'training')

# Choosing the validation directory for data generator
validation_generator = train_datagen.flow_from_directory(
    target_size = (IMG_H, IMG_W),
    batch_size = BATCH_SIZE,
    class_mode = 'binary',
    shuffle = False,
    subset = 'validation')

# Fitting the model
history = model.fit_generator(
    steps_per_epoch = train_generator.samples // BATCH_SIZE,
    validation_data = validation_generator, 
    validation_steps = validation_generator.samples // BATCH_SIZE,
    epochs = EPOCHS)
Found 69 images belonging to 2 classes. 
Found 29 images belonging to 2 classes. 
Epoch 1/48 11/11 [==============================] - 5s 431ms/step - loss: 0.9489 - accuracy: 0.6667 - val_loss: 0.6989 - val_accuracy: 0.8750 
Epoch 2/48 11/11 [==============================] - 5s 449ms/step - loss: 0.6878 - accuracy: 0.6515 - val_loss: 0.4423 - val_accuracy: 0.7826 
Epoch 3/48 11/11 [==============================] - 5s 413ms/step - loss: 0.7247 - accuracy: 0.7333 - val_loss: 0.3371 - val_accuracy: 0.6522 
Epoch 4/48 11/11 [==============================] - 4s 392ms/step - loss: 0.6387 - accuracy: 0.7460 - val_loss: 0.3243 - val_accuracy: 0.6522 
Epoch 5/48 11/11 [==============================] - 4s 370ms/step - loss: 0.6985 - accuracy: 0.7143 - val_loss: 0.9496 - val_accuracy: 0.6522 
Epoch 6/48 11/11 [==============================] - 5s 435ms/step - loss: 0.6414 - accuracy: 0.6190 - val_loss: 0.7891 - val_accuracy: 0.8750 
Epoch 7/48 11/11 [==============================] - 5s 447ms/step - loss: 0.5649 - accuracy: 0.7302 - val_loss: 0.5828 - val_accuracy: 0.7826 
Epoch 8/48 11/11 [==============================] - 4s 401ms/step - loss: 0.7349 - accuracy: 0.6818 - val_loss: 0.5026 - val_accuracy: 0.6522 
Epoch 9/48 11/11 [==============================] - 4s 391ms/step - loss: 0.4979 - accuracy: 0.7778 - val_loss: 0.0903 - val_accuracy: 0.6522 
Epoch 10/48 11/11 [==============================] - 4s 369ms/step - loss: 0.6699 - accuracy: 0.6508 - val_loss: 1.2273 - val_accuracy: 0.6957 
Epoch 11/48 11/11 [==============================] - 4s 387ms/step - loss: 1.0450 - accuracy: 0.7937 - val_loss: 0.5215 - val_accuracy: 0.9167 
Epoch 12/48 11/11 [==============================] - 5s 412ms/step - loss: 0.4828 - accuracy: 0.8413 - val_loss: 0.1677 - val_accuracy: 0.8696 
Epoch 46/48 11/11 [==============================] - 4s 396ms/step - loss: 0.0094 - accuracy: 1.0000 - val_loss: 1.7849 - val_accuracy: 0.9167 
Epoch 47/48 11/11 [==============================] - 5s 455ms/step - loss: 0.1794 - accuracy: 0.9524 - val_loss: 1.5018e-06 - val_accuracy: 1.0000 
Epoch 48/48 11/11 [==============================] - 4s 383ms/step - loss: 0.2225 - accuracy: 0.9365 - val_loss: 2.9316e-05 - val_accuracy: 0.9565

We have trained our model for 48 epochs and our final validation accuracy is 0.9565.

5. Visualizing the model results

Let us visualize the training and validation results of the model.

# Plotting the training accuracy and validation accuracy
plt.title('model accuracy')
plt.legend(['train', 'test'], loc='upper left')

# Plotting the training loss and validation loss
plt.title('model loss')
plt.legend(['train', 'test'], loc='upper left') 
Looking at the charts, we can see how our loss has gradually decreased and accuracy has gradually increased for both our training and validation dataset.

6. Evaluating the model using confusion matrix

It is time for us to evaluate the model using a confusion matrix.

# Getting some predictions using the validation generator
pred = model.predict(validation_generator)
predicted_class_indices = np.argmax(pred, axis=1)
labels = (validation_generator.class_indices)
labels2 = dict((v,k) for k,v in labels.items())
predictions = [labels2[k] for k in predicted_class_indices]

# Creating the confusion matrix
cf = confusion_matrix(predicted_class_indices,label)
array([[21, 8], 
       [ 0, 0]])

Our model has predicted 29 of the images as COVID-19 positive where 21 cases are actually COVID-19 positive (True Positive) and 8 cases are actually COVID-19 negative (False Positive).

The reason why all of our predictions are coming out as COVID-19 positive is because our starting dataset is very imbalanced and thus, accuracy is not the right metric for training the model. Instead, we should be using metrics such as precision and recall. However, that would make this tutorial more complicated and harder to digest in a single go.

In Conclusion

There you have it! The goal of this tutorial was to give students a starting point for using Machine Learning to detect the Coronavirus (COVID-19). Again, this tutorial is just for educational purposes and is not aimed at building a production-ready coronavirus detection model.

