The goal of a regression problem in machine learning is to find the value of a function that can accurately predict the data pattern. Similarly, a classification problem involves finding the value of the function that can accurately classify the different classes of data. The accuracy of the model is determined on the basis of how well the model predicts the output values, given the input values. Here, we will be discussing one such metric used in iteratively calibrating the accuracy of the model, known as the cost function.
[latexpage]How does a Machine Learning model learn?
Before answering the question of how does the model learn, it is important to know what does the model actually learn? This varies from model to model. However, in simple words, the objective of a model is to learn a function $f(x)$ such that it is able to predict the value of an output variable $y$ on the basis of the input variable $x$.
Let us suppose a regression model for predicting the housing prices based on the size of the house. This is an example of simple linear regression with a single input variable (size) and an output variable (price). The function is presented as,
$$y = wx + b$$
where
$y$ is the dependent variable representing the house price.
$x$ is the independent variable representing the size of the house,
$w$ is the weight, and
$b$ is the bias.
Here, the values $w$ are $b$ are the parameters that the model needs to learn, to be able to predict the value of $y$ for a value of $x$.
Now, getting back to our concern about how a model learns its parameters. In basic terms, a model learns by experience. The process can be compared to the learning process of a toddler who is unaware of most of the things around him. He gradually learns about the right way of doing things with the trial-and-error method. Here, the parents are essentially responsible for building the basic instincts of common sense and good behaviors to a child by praising the child when he does something good, and vice-versa.
Consider the phase when the toddler is learning how to walk. She would typically start by trying to stand up on her own. However, she is bound to fall on her first attempt as she doesn’t know that she needs to balance her feet to be able to stand still. Next time when she tries, she has already learned that she will fall if she tries the same way as before. So she tries to take support from the nearby wall so that she doesn’t end up falling. Hence, this gradual process of learning is involved where the toddler learns about balancing her body to stand still and eventually learns to walk. The learning problem here is to find the balance so as to minimize falling, which is similar to what the cost function does.
What are Cost Functions?
Cost functions in machine learning can be defined as a metric to determine the performance of a model. A cost function is computed as the difference or the distance between the predicted value and the actual value. It is estimated by running several iterations on the model to compare estimated predictions against the true values of $y$. It gives an estimate of how good the model is performing.
It is also known as the loss function or the error metric of the model. Lesser the value of cost function, better the model. Therefore, the cost function gives the value of how far the predicted value is from the actual value of the model and tries to adjust the parameters so that the model becomes better. The goal of a machine learning or a deep learning model is hence to find the best set of parameters through an iterative process that minimizes the cost function until it cannot be minimized further.
Types of cost functions
Let us now have a closer look at some of the common types of cost functions used in machine learning.
1. Distance based error
Distance-based error is the fundamental cost function that binds the concept for various kinds of cost functions. For a given set of input data, suppose the actual output is $y$. The model starts with random initialization of the parameters $w$ and $b$ for the function $y = wx + b$ and the predicted output from the model is given as $y’$, then the distance-based error is calculated as,
$$Error = y – y’$$
where
$y$ is the actual value,
$y’$ is the predicted value from the model.
This equation forms the basis for the calculation of cost functions for regression problems. However, simply calculating the distance-based error function is prone to negative errors and hence we will be discussing another type of cost function that overcomes this limitation.
2. Mean Squared Error (MSE)
Mean squared error is one of the simplest and commonly used cost functions in machine learning. It is also known as the sum of squared errors as it sums the values of square errors and averages them. It overcomes the limitation of distance-based error as it squares the value of error difference eliminating any kind of negative values. The mean squared error is also known as L2 loss and is calculated as,
$$MSE = \frac{1}{N} \sum_{i =1}^{N} (y – y’)^2$$
where
$y$ is the actual value of the output,
$y’$ is the predicted value of the output, and
$N$ is the total number of observations taken.
For data prone to outliers and noise, MSE further magnifies the error value, which results in a huge increase in the overall cost function. We will hence discuss another type of cost function known as the Mean Absolute Error that can help solve this problem.
3. Root Mean Squared Error (RMSE)
The root mean squared error is the square root of the mean squared error cost function that we discussed above. The root mean squared error is calculated as,
$$RMSE = \sqrt{\frac{1}{N} \sum_{i =1}^{N} (y – y’)^2}$$
where
$y$ is the actual value of the output,
$y’$ is the predicted value of the output, and
$N$ is the total number of observations taken.
RMSE is considered to be a good measure of a model’s performance if we want to estimate the standard deviation ($\sigma$) of a typical observed value from our model’s prediction.
4. Mean Absolute Error (MAE)
Mean Absolute Error is similar to the Mean Squared Error but takes the absolute difference between the actual and the predicted value in order to avoid the possibility of negative error. It addresses the drawback of the MSE as the error value is not squared. The mean absolute error is also known as L1 loss and is calculated as,
$$MAE = \frac{1}{N} \sum_{i =1}^{N} (|y – y’|)$$
where
$y$ is the actual value of the output,
$y’$ is the predicted value of the output, and
$N$ is the total number of observations taken.
5. Cross Entropy function
Cross entropy function is also known as the log loss function. It is used to measure the performance of a classification model whose output is represented as a probability value between 0 and 1.
The function measures the distance between two probability distributions $p$ and $q$, where $p$ is the actual probability distribution and $q$ is the predicted probability distribution of the output from the model. The cross-entropy is calculated as,
$$H(x) = -\sum_{i = 1}^{N} p(x) logq(x)$$
where
$p(x)$ is the probability distribution of the actual values,
$q(x)$ is the probability distribution of the predicted values, and
$N$ is the total number of observations taken.
For example, assume a classification problem with 3 classes of fruit images: Orange, Apple, Mango. For each of the three possible classes, the trained classification model outputs a predicted probability. The predicted value of probability distribution from the model is $q$ = [0.5, 0.2, 0.3]. Here, the problem is a supervised learning problem and we know that the input is an Orange. Hence, the actual probability distribution for the problem is $p$ = [1, 0, 0].
The cross-entropy function measures the deviation between the two distributions, where, a higher deviation between the two values results in a higher cross-entropy. Hence, the cross-entropy for the model is calculated as,
$$H(x) = -\sum_{i = 1}^{N} p(x) logq(x) = -1* log(0.5) = 0.693$$
More the cross-entropy, the lesser the accuracy of the model. A model with a log loss of 0 is the example of a perfect model.
6. Kullback-Leibler (KL) Divergence
The KL Divergence function is quite similar to cross-entropy and is a measure of the difference (or the divergence) between two probability distributions. The function measures the relative distance between two probability distributions $p$ and $q$, where $p$ is the actual probability distribution and $q$ is the predicted probability distribution of the output from the model. The Kullback-Leibler divergence from $q$ to $p$ is calculated as,
$$D_{KL}(p \| q) = \sum_{i = 1}^{N} p(x) log(\frac{p(x)}{q(x)})$$
where
$p(x)$ is the probability distribution of the actual values,
$q(x)$ is the probability distribution of the predicted values, and
$N$ is the total number of observations taken.
7. Hinge loss
The hinge loss function is a common cost function used in Support Vector Machines (SVM) for classification. It maps the output to values between 1, 0, -1. The Hinge loss function is calculated as,
$$H(x) = max(0, 1- y*h(y))$$
where
$y$ is actual value of the output,
$h(y)$ is the classification score predicted by the model.
From the function, it can be determined that when $y*h(y) \geq 1$, then the cost function is zero. However, when $y*h(y) \lt 1$, the cost function increases.
Hence, the figure below illustrates the hinge loss function for the actual value of $y = 1$.
Similarly, the figure below illustrates the hinge loss function for the actual value of $y = 0$.
Conclusion
Cost functions, also known as loss functions are an essential part of training and building a robust model in data science. There are several types of cost functions used in training machine learning and deep learning models. In this article, we discussed about some major cost functions that are adopted based on the type of the problem.
We hope you found this article insightful. If you are a beginner looking to learn data science, we have a detailed 3-month course specialization in data science. Start learning through TCR’s Data Science Courses!
Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:
- Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
- Introduction to Data Science in Python- 400,000+ students already enrolled!
- Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
- Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!