Machine Learning is a sub-field of Artificial Intelligence (AI) that enables computer systems to learn and improve at performing a wide range of tasks without the need to be explicitly programmed. It has gained immense popularity over the last few decades due to many reasons such as the rise in computational power, generation of more volume of data, discovery of new implementation use cases, etc.
In a high-level overview, Machine Learning algorithms can be classified into three major categories based on the objective of learning:
Student Number | Number of class attendances | Home assignment marks | Number of completed projects | Examination Result (Pass/Fail) |
1 | 78 | 9 | 3 | Pass |
2 | 56 | 6 | 1 | Fail |
3 | 88 | 8 | 3 | Pass |
4 | 72 | 7 | 3 | Pass |
5 | 86 | 9 | 5 | Pass |
6 | 60 | 5 | 1 | ? |
Here, a good supervised learning model will predict that student number '6' will fail the examination since it shares a lot of traits with student number '2' who has failed the examination in the past.
In this course, we will be focusing on Supervised Learning and its two most widely used techniques: Regression and Classification.
Before we move on to learn about the various techniques and algorithms used for Machine Learning, let us learn about some of the basic concepts involved with Supervised Machine Learning.
In Supervised Machine Learning, the dataset generally contains two types of data variables:
As mentioned in the introductory section, with supervised learning, we can predict if a student will pass or fail an exam based on his/her number of class attendances, home assignment marks and the number of completed projects.
Here, the dependent variable is the examination result (pass/fail) and since it is dependent on the remaining other columns, the remaining other columns are independent variables.
Variable | Type of Variable |
Examination Result (Pass/Fail) | Dependent (Target) |
Number of class attendances | Independent (Feature) |
Home assignment marks | Independent (Feature) |
Number of completed projects | Independent (Feature) |
(Note that a dataset may contain two or even more dependent variables and such cases are common in complex Machine Learning problems.)
When a model is being trained on a dataset, it is trying to get the predicted value of the dependent variable to be as close as the actual value of the dependent variable.
Consider the following set of actual target values and predicted target values for examination results of 5 students,
Student Number | Number of class attendances | Home assignment marks | Number of completed projects | Actual Examination Result (Pass/Fail) | Predicted Examination Result (Pass/Fail) |
1 | 78 | 9 | 3 | Pass | Pass |
2 | 56 | 6 | 1 | Fail | Fail |
3 | 88 | 8 | 3 | Pass | Fail |
4 | 72 | 7 | 3 | Pass | Pass |
5 | 86 | 9 | 5 | Pass | Fail |
Here, the model has successfully predicted the target value for students '1', '2' and '4' but it was incorrect in predicting the target value for students '3' and '5'. The Error Percentage can thus be calculated as,
Error Percentage = (Number of incorrect predictions/Number of total predictions)*100%
And, the Accuracy Percentage can be calculated as,
Accuracy Percentage = 100% - Error Percentage.
So, for the above dataset, the Error Percentage is obtained as,
Error Percentage = (2/5)*100% = 40%
And, the Accuracy Percentage is obtained as,
Accuracy Percentage = 100% - 40% = 60%
This is one of the simplest ways in which a Supervised Machine Learning model can be evaluated.
When evaluating a model, it is important to know how the model performs on data it hasn't been trained on. Therefore, we keep a subset of data from the dataset as the test set for evaluation purposes.
As a general rule of thumb, the dataset is split into a training/test set in a ratio of 70:30 randomly. This means that for a dataset containing 100 rows of data, the model will only train on 70 rows of randomly chosen data points and the remaining 30 rows will be used to evaluate the accuracy of the model.
Performing model evaluation on the training set will give the training set accuracy and performing model evaluation on the testing set will give the test set accuracy.
Although the bias-variance tradeoff sounds like a complex term, it is quite simple to understand.
When a Machine Learning model fails to perform well in predicting the labels of the training set, it is said to have high bias and low variance. Such kind of performance indicates that the model is under-fitted and is inaccurate on the training set.
On the other hand, when a Machine Learning model performs great in the predicting the labels of the training set but fails to perform well in predicting the labels of the test set, it is said to have low bias and high variance. Such kind of performance indicates that the model is over-fitted and is extremely accurate on the training set but inaccurate on the test set.
Under-fitting and over-fitting are two of the most common challenges that Machine Learning models are prone to. Therefore, to build the best performing model, there should be a tradeoff between the bias and the variance such that the model is accurate in predicting the labels of the training as well as the test set. This is the bias-variance tradeoff.
Most Supervised Machine Learning models are trained and evaluated using the same basic process as shown in the diagram below.
The steps are as follows:
The whole process is mostly iterated multiple times until satisfactory results are observed in the model evaluation stage.
Now, let us move onwards to the next lesson where we will be setting up the necessary libraries and tools required for building Supervised Machine Learning models using Python.