Simple Linear Regression is one of the simplest and fundamental regression algorithms. A linear regression models the linear relationship between a dependent variable, $y$, and one or more independent variables, $x$, using a straight line.
In linear regression, the value of $y$ can be calculated by multiplying each input variable $x$ by a constant and adding them all together. Mathematically, the relationship between the dependent variable, $y$, and the independent variable(s), $x$, can be represented as,
$$y = f(x)$$
where, the value of $y$ is a function of $x$.
The case where a single independent variable is used is called Simple Linear Regression. For more than one independent variables, the process is called Multiple Linear Regression.
What is Simple Linear Regression?
Simple Linear Regression is a linear regression algorithm used in datasets containing a single dependent variable and a single independent variable. It is also sometimes referred to as linear regression with single variable.
The relationship between the two variables is obtained by multiplying the independent variable by a constant value (known as weight) and then adding a constant (known as bias) to the product. Mathematically, the formula for linear regression an be represented as,
$$y = wx + b$$
where,
$y$ is the dependent variable,
$x$ is the independent variable,
$w$ is the weight,
$b$ is the bias or the intercept, and
$n$ is a positive integer.
If you are familiar with mathematics, the equation for linear regression is the same as the equation of a straight line, i.e., $y = mx + c$ where $m$ represents the slope of the straight line and $c$ represents the $y$-intercept of the line. So, a simple linear regression model is nothing but a straight line in two-dimensions.
The goal of simple linear regression model in Machine Learning is to find the value of weight, $w$, and bias, $b$, that can best predict the value of $y$ for a given value of $x$. This is also known as ‘fitting the model to the available data’.
The intuition behind Simple Linear Regression
For a more intuitive understanding of what linear regression with single variable is, let us consider the following example.
The data provided below is of a company that has 5 employees. In the available data, we can observe the salary of the employees along with their respective years of experience.
Years of Experience | Salary (in USD) |
5 | 100,000 |
4 | 80,000 |
3 | 60,000 |
2 | 40.000 |
1 | 20,000 |
Since the data above has only two columns, it can easily be visualized onto a two-dimensional graph to better observe the relationship between the variables.
Now, consider a new employee is going to join the company and his/her salary is to be determined based on his/her past years of experience. This is a typical regression problem.
Since the plotted data displays a linear relationship, we can solve this problem using Simple Linear Regression where, the dependent variable ($y$) = Salary (in USD) and the independent variable ($x$) = Years of Experience.
The equation of Simple Linear Regression is as follows,
$$y = f(x) = wx + b$$
Machine Learning is an iterative process where the data is fed to the model multiple times with the aim of improving the model. In the first iteration, we do not have any idea about what the value of the weight ($w$) and the value of the bias ($b$) should be. So, we randomly initialize the values of $w$ and $b$ to form a random straight line as shown in the graph below.
This is our starting point for the Simple Linear Regression algorithm.
Now, to calculate how well the random line fits the given data, we use a special kind of function known as the cost function. The cost function gives a measure of how bad the model is performing in relationship to the actual target values and the values predicted by the model. This is also known as calculating the loss or error of the model.
In the graph below, the blue line indicates the loss between the actual data points and the predicted values.
Once the loss has been calculated, gradient descent is applied which updates the value of the weight and the bias in such a way that the loss between the actual values and the predicted values is minimized.
In the above graph, the Simple Linear Regression algorithm seems to fit the data perfectly in just one iteration of gradient descent.
In most real-world problems, the process of computing the loss (using the cost function) and performing gradient descent is iterated multiple times until the loss is fully minimized. This whole process is collectively called model training.
Predicting using the Simple Linear Regression model
Once the model is successfully trained, the best fitting values of $w$ and $b$ are obtained. Using these values, we can then predict any other target values (Salary) based on the value of the feature (Years of Experience).
As an example, let us try predicting the salary for an employee with 6 years of experience if the best-fitting value of $w=20000$ and $b=0$.
Years of Experience | Salary (in USD) |
6 | ? |
5 | 100,000 |
4 | 80,000 |
3 | 60,000 |
2 | 40.000 |
1 | 20,000 |
The Simple Linear Regression formula is as follows,
$$y = wx + b \\ = 20000x + 0$$
The above equation can be interpreted as: for each increment in $x$, the value of $y$ increments by $20,000 * x$. So, for an employee with 6 years of experience, his/her salary would be 120,000.
$$ y = 20000 * 6 + 0 = 120000$$
We obviously have no way to know if the prediction is correct or not but it certainly looks to be correct looking at the available dataset. This is how we can predict using our trained model.
Years of Experience | Salary (in USD) |
6 | 120,000 |
5 | 100,000 |
4 | 80,000 |
3 | 60,000 |
2 | 40.000 |
1 | 20,000 |
Simple Linear Regression in Python
Now that we know the basic idea of Simple Linear Regression, here is a step-wise Python implementation of the algorithm.
1. Importing necessary libraries
First, let us import some essential Python libraries.
# Importing necessary libraries import numpy as np # for array operations import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn for model building and validation from sklearn.linear_model import LinearRegression # for building the model from sklearn.metrics import mean_squared_error # for calculating the cost function
2. Creating the dataset
For the Salary determination problem, we will be manually creating a dataset using NumPy.
# Creating a dummy dataset using numpy # Years of experience x = np.array([1, 2, 3, 4, 5]).reshape((-1, 1)) # Salary (in USD) y = np.array([20000, 40000, 60000, 80000, 100000])
3. Fitting the model to the data
We will now initialize a Linear Regression model and fit it to the training data. This is done with the help of the LinearRegression() module of scikit-learn.
# Initializing the Linear Regression model model = LinearRegression() # Fitting the Simple Linear Regression model to the data model.fit(x, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
We have trained (fitted) the model in just two lines of code!
4. Summarizing the model
The goal of model training is to determine the value for the x-coefficient (weight) and the intercept (bias) that results in a straight line that best fits the data distribution. Let us print the value of these variables from the fitted model.
# x-coefficient print("\nCoefficients: \n", model.coef_) # Intercept print("\nIntercept: \n", model.intercept_)
Coefficients: [20000.] Intercept: 0.0
5. Calculating the loss after training
Let us now calculate the loss between the actual target values and the values predicted by the model with the use of a cost function called the Root Mean Square Error (RMSE).
$$RMSE = \sqrt{(\frac{1}{n})\sum_{i=1}^{n}(y_{i} – \hat{y_{i}})^{2}}$$
where,
$y_i$ is the actual target value,
$\hat{y_{i}}$ is the predicted target value, and
$n$ is the total number of data points.
The RMSE of a model determines the absolute fit of the model to the data. In other words, it indicates how close the actual data points are to the model’s predicted values. A low value of RMSE indicates a better fit and is a good measure for determining the accuracy of the model’s predictions.
# Predicting the target values using the model y_pred = model.predict(x) # RMSE (Root Mean Square Error) as the cost function rmse = float(format(np.sqrt(mean_squared_error(y, y_pred)), '.3f')) print("\nRMSE:\n", rmse)
RMSE: 0.0
Here, a RMSE score of zero indicates that the line fits perfectly to the training dataset.
5. Visualizing the results
Let us now visualize the results by plotting the actual target values and the predicted target values to see how well the model is fitted.
# Plotting the results over the data plt.figure(figsize=(10, 6)) plt.scatter(x, y, color='r') plt.plot(x, y_pred, color='#20ad96') plt.xlabel('Years of Experience') plt.ylabel('Salary (USD)') plt.show()
Here, the line looks to fit the data values perfectly.
Putting it all together
The final code for the implementation of Simple Linear Regression in Python is as follows.
# Importing necessary libraries import numpy as np # for array operations import matplotlib.pyplot as plt # for data visualization %matplotlib inline # scikit-learn for model building and validation from sklearn.linear_model import LinearRegression # for building the model from sklearn.metrics import mean_squared_error # for calculating the cost function # Creating dummy data using NumPy # Years of experience x = np.array([1, 2, 3, 4, 5]).reshape((-1, 1)) # Salary (in USD) y = np.array([20000, 40000, 60000, 80000, 100000]) # Initializing the Linear Regression model model = LinearRegression() # Fitting the Simple Linear Regression model to the data model.fit(x, y) # x-coefficient print("\nCoefficients: \n", model.coef_) # Intercept print("\nIntercept: \n", model.intercept_) # Predicting the target values using the model y_pred = model.predict(x) # RMSE (Root Mean Square Error) as the cost function rmse = float(format(np.sqrt(mean_squared_error(y, y_pred)), '.3f')) print("\nRMSE:\n", rmse) # Plotting the results over the data plt.figure(figsize=(10, 6)) plt.scatter(x, y, color='r') plt.plot(x, y_pred, color='#20ad96') plt.xlabel('Years of Experience') plt.ylabel('Salary (USD)') plt.show()
We have successfully learned the fundamentals of Linear Regression with single variable, one of the primary Machine Learning algorithms, and implemented it in Python. We will now move on to discover how to implement Linear regression for data with multiple variables in the next lesson.