Before we get started on discussing how a neuron learns, let us first understand how a human learns to solve a simple mathematical problem.
Learning to solve a mathematical problem
Considering the equation given below, find the value of $x$ and $y$ such that the equation holds correct, i.e LHS = RHS?
$$256x + 212y = 1492$$
A simple yet time-consuming approach to solving this is by randomly guessing the values of $x$ and $y$ until the equation is satisfied. Although this approach works, this is just random guessing and no learning is happening. This is similar to when we had randomly initialized the weights of the neuron in the chapter, ‘What are neurons?’, and we had luckily received the prediction ‘Obese’.
This process is also called Monte-Carlo simulation.
The second and a more appropriate approach to solving this problem is by optimizing the value of the variables progressively in the following manner:
- Randomly choose the value of $x$ and $y$.
- Use the value of $x$ and $y$ in the give equation and calculate the error, i.e., the difference between the LHS and RHS.
- If the error is positive, decrease the value of $x$ and $y$ alternatively till LHS = RHS.
- If the error is negative, increase the value of $x$ and $y$ alternatively till LHS = RHS.
In this approach, we are learning to correct our values of $x$ and $y$ based on the error (the difference between LHS and RHS). In a realistic scenario, it may take some time for us to get to the solution but we can check along the way if the value of error is decreasing and converging to 0 or not.
We can also change our increment/decrement value of $x$ and $y$ based on the error. If the error is large, we can take huge steps while increasing/decreasing the value of our variables and if the error is small, we can take smaller steps while increasing/decreasing the value of our variables.
Taking a huge step means that we may decrease our error quickly but we may also overshoot and never reach to 0. Taking a small step means that we may reach to 0 gradually but our training process is very slow. Choosing the right amount of increment/decrement is therefore very necessary.
Which approach does a neuron follow for supervised learning?
The neuron uses the second approach for learning, i.e., variable optimization.
Just replace $x$ and $y$ with $w_1$ and $w_2$ and you’ll get a clear picture. Here, bias $b$ is 0.
$$ x_1 w_1 + x_2 w_2 + b = y $$
$$ \rightarrow 256w_1 + 212w_2 + 0 = 1492$$
Loss function and Gradient Descent
In standard practice, we use a loss function, $J(\textbf{w}_n)$ to calculate the loss (error) in each training step of a neuron, i.e., the difference between the predicted value and the actual value present in the training dataset.
Then, we try to progressively minimize the loss of the neuron using an optimization algorithm for a finite number of steps. A popular way to optimize the neuron’s loss is by using gradient descent.
Here is how gradient descent works:
- Choose a random set of weights $\textbf{w}_0$ initially.
- Feed the input into the neuron along with the weights and bias and calculate the loss using the loss function $J(\textbf{w}_n)$.
- Update the weights assigned to each input using the gradient formula.
- Iterate step 2 and 3 till a finite number of steps $n+1$.
At each iteration, $(0, 1, 2, 3, …, n+1)$ the weights of the neuron are updated using the following formula,
$$ \textbf{w}_{n+1} = \textbf{w}_n – \alpha\nabla J(\textbf{w}_n) \ \ \ \dots \text{eqn}(i) $$
where, $\textbf{w}_{n+1}$ represents the value of weights at iteration $n+1$, $\textbf{w}_n$ represents the value of weights at iteration $n$, $\alpha$ represents the learning rate (usually very small in practice such as 0.001) and $\nabla J(\textbf{w}_n)$ represents the gradient of the cost function $J(\textbf{w}_n)$.
Here, $\nabla J(\textbf{w}_n)$ is the matrix of partial derivatives (gradients) of cost function $J(\textbf{w}_n)$ and weights $(w_1, w_2, w_3, …, w_n)$,
$$\nabla J(\textbf{w}_n) = \begin{pmatrix} \dfrac{\delta J(\textbf{w}_n)}{\delta (w_1)} \\ \\ \dfrac{\delta J(\textbf{w}_n)}{\delta (w_2)} \\ \\ \vdots \\ \\ \dfrac{\delta J(\textbf{w}_n)}{\delta (w_i)} \end{pmatrix} $$
Now, let us have a good look at $\text{eqn}(i)$ and understand what gradient descent means for a neuron.
At the start of the algorithm, we choose a random set of weights. Then, the value of weights for the next iteration is calculated by decreasing the value of weights of the previous iteration with the gradient of the loss function.
The speed of decrease or descent is decided by $\alpha$. The greater the value of $\alpha$, the faster our descent and vice versa. We choose a finite step of iterations to run this entire process and by the end of the training run, we hope to have minimized the loss by finding the possible value of weights.
This is how a neuron learns to make an accurate prediction based on the prediction possible. In the next lesson, we will finally see how a neuron acts as the backbone of deep learning by studying deep neural networks.
Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:
- Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
- Introduction to Data Science in Python- 400,000+ students already enrolled!
- Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
- Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!