Machine Learning is a sub-field of Artificial Intelligence (AI) that enables computer systems to learn and improve at performing a wide range of tasks without the need to be explicitly programmed. It has gained immense popularity over the last few decades due to many reasons such as the rise in computational power, generation of more volume of data, discovery of new implementation use cases, etc.
In a high-level overview, Machine Learning algorithms can be classified into three major categories based on the objective of learning:
- Supervised Machine Learning: The objective of supervised machine learning is to predict a label based on a set of features. For example, we can predict if a student will pass or fail an exam based on his/her number of class attendances, home assignment marks, and number of completed projects.
Student Number | Number of class attendances | Home assignment marks | Number of completed projects | Examination Result (Pass/Fail) |
1 | 78 | 9 | 3 | Pass |
2 | 56 | 6 | 1 | Fail |
3 | 88 | 8 | 3 | Pass |
4 | 72 | 7 | 3 | Pass |
5 | 86 | 9 | 5 | Pass |
6 | 60 | 5 | 1 | ? |
Here, a good supervised learning model will predict that student number ‘6’ will fail the examination since it shares a lot of traits with student number ‘2’ who has failed the examination in the past.
- Unsupervised Machine Learning: The objective of unsupervised learning is to draw inference from a set of features. For example, with unsupervised learning, we can group students based on the number of projects they have completed. Students with a low number of completed projects are assigned to one group and students with a high number of completed projects are assigned to another group.
- Reinforcement Machine Learning: The objective of reinforcement learning is to train an agent (bot) at performing some specific task through an iterative trial-and-error method. For example, with reinforcement learning, we can train a chess-playing bot to win at chess games by making it play thousands of games beforehand. This allows the agent to understand all the moves necessary to win the game and not make any losing moves.
In this course, we will be focusing on Supervised Learning and its two most widely used techniques: Regression and Classification.
Basic concepts related to Supervised Machine Learning
Before we move on to learn about the various techniques and algorithms used for Machine Learning, let us learn about some of the basic concepts involved with Supervised Machine Learning.
1. Characteristics of a dataset in Supervised Machine learning
In Supervised Machine Learning, the dataset generally contains two types of data variables:
- Dependent variable (Target) – The dependent variable is the label or the target we want to predict.
- Independent variable (Feature) – The independent variable is the feature that the target is dependent on.
As mentioned in the introductory section, with supervised learning, we can predict if a student will pass or fail an exam based on his/her number of class attendances, home assignment marks and the number of completed projects.
Here, the dependent variable is the examination result (pass/fail) and since it is dependent on the remaining other columns, the remaining other columns are independent variables.
Variable | Type of Variable |
Examination Result (Pass/Fail) | Dependent (Target) |
Number of class attendances | Independent (Feature) |
Home assignment marks | Independent (Feature) |
Number of completed projects | Independent (Feature) |
(Note that a dataset may contain two or even more dependent variables and such cases are common in complex Machine Learning problems.)
2. Evaluation of a Supervised Machine Learning model
When a model is being trained on a dataset, it is trying to get the predicted value of the dependent variable to be as close as the actual value of the dependent variable.
Consider the following set of actual target values and predicted target values for examination results of 5 students,
Student Number | Number of class attendances | Home assignment marks | Number of completed projects | Actual Examination Result (Pass/Fail) | Predicted Examination Result (Pass/Fail) |
1 | 78 | 9 | 3 | Pass | Pass |
2 | 56 | 6 | 1 | Fail | Fail |
3 | 88 | 8 | 3 | Pass | Fail |
4 | 72 | 7 | 3 | Pass | Pass |
5 | 86 | 9 | 5 | Pass | Fail |
Here, the model has successfully predicted the target value for students ‘1’, ‘2’ and ‘4’ but it was incorrect in predicting the target value for students ‘3’ and ‘5’. The Error Percentage can thus be calculated as,
Error Percentage = (Number of incorrect predictions/Number of total predictions)*100%
And, the Accuracy Percentage can be calculated as,
Accuracy Percentage = 100% – Error Percentage.
So, for the above dataset, the Error Percentage is obtained as,
Error Percentage = (2/5)*100% = 40%
And, the Accuracy Percentage is obtained as,
Accuracy Percentage = 100% – 40% = 60%
This is one of the simplest ways in which a Supervised Machine Learning model can be evaluated.
3. Train/Test split
When evaluating a model, it is important to know how the model performs on data it hasn’t been trained on. Therefore, we keep a subset of data from the dataset as the test set for evaluation purposes.
As a general rule of thumb, the dataset is split into a training/test set in a ratio of 70:30 randomly. This means that for a dataset containing 100 rows of data, the model will only train on 70 rows of randomly chosen data points and the remaining 30 rows will be used to evaluate the accuracy of the model.
Performing model evaluation on the training set will give the training set accuracy and performing model evaluation on the testing set will give the test set accuracy.
4. The bias-variance tradeoff
Although the bias-variance tradeoff sounds like a complex term, it is quite simple to understand.
When a Machine Learning model fails to perform well in predicting the labels of the training set, it is said to have high bias and low variance. Such kind of performance indicates that the model is under-fitted and is inaccurate on the training set.
On the other hand, when a Machine Learning model performs great in the predicting the labels of the training set but fails to perform well in predicting the labels of the test set, it is said to have low bias and high variance. Such kind of performance indicates that the model is over-fitted and is extremely accurate on the training set but inaccurate on the test set.
Under-fitting and over-fitting are two of the most common challenges that Machine Learning models are prone to. Therefore, to build the best performing model, there should be a tradeoff between the bias and the variance such that the model is accurate in predicting the labels of the training as well as the test set. This is the bias-variance tradeoff.
5. The Supervised Machine Learning process
Most Supervised Machine Learning models are trained and evaluated using the same basic process as shown in the diagram below.
The steps are as follows:
- Data preparation: Data preparation is one of the most challenging and time-consuming task in any Machine Learning process. In this step, all necessary data is collected from various sources, pre-processed and then, split into a training as well as a test set for further processing.
- Model Building: The actual model is built in this step using various kinds of Supervised Machine Learning algorithms.
- Model Training: The built model is trained by feeding it with the training data iteratively. In each iteration, the model tries to become more and more accurate by decreasing its error. Training is stopped when a certain number of finite iterations is reached or when other pre-defined stopping criterias are met.
- Model Evaluation: The trained model is evaluated against a test set for determining its performance and finding ways to improve it.
The whole process is mostly iterated multiple times until satisfactory results are observed in the model evaluation stage.
Now, let us move onwards to the next lesson where we will be setting up the necessary libraries and tools required for building Supervised Machine Learning models using Python.