Linear Regression from Scratch in Python

Aloha Churchill
7 min readJun 16, 2020

Implementation of basic gradient descent in Python to find parameters step-by-step!

What is Linear Regression?

Linear regression is creating a linear function to model the relationship between two variables¹. Linear regression is what Excel and Google Sheets use to create a “trendline” for a certain data set. This article covers some of the intuition and math behind linear regression and will show one way to implement linear regression in Python using a technique called gradient descent, which is commonly used in machine learning.

Figure 1. Example of trendline in Google Sheets

The Basics of Linear Regression

So where do we begin to create a linear model that can be fitted to any data? Let’s start with something familiar: the equation for a line.

The goal of linear regression is to somehow find m, the slope, and b, the y-intercept. Instead of m and b, we’ll call these constants theta 1 and theta 0 respectively. theta1 and theta0 are called parameters. In linear regression and gradient descent, the predictive linear model is called the hypothesis and is expressed as:

Although there are multiple ways to calculate the parameters, in this article, I’ll be covering how to use gradient descent to solve for the parameters. Gradient descent is a general method of solving for parameters and comes in handy when there are more than two parameters to solve for. To begin gradient descent, we will initialize the parameters to arbitrary values. In this example, I’ll set both of them equal to 0.

params = [0,0]

Now, say you have a data set of size n with input data (x) and output data (y). In code, we will simulate input data by creating a random 1D NumPy array.

x = np.random.rand(100,1)

Likewise, we will simulate the output data by using arbitrary parameters to create a linear relationship between the inputs and outputs. We will also add some noise to this data to create a more realistic simulation.

# test constants for line
TEST_SLOPE = 5
TEST_Y_INT = 7
y = TEST_SLOPE*x+np.random.randn(100,1) + TEST_Y_INT

Now that we have a sample data set, let’s take a look at the math behind linear regression and gradient descent.

Cost Function

To best fit our parameters, we need a way to check the error or goodness of fit of a certain model given parameters theta0 and theta1. This error function is sometimes called the cost function and is shown below:

where n is the number of given data points

This function is giving us a value called the mean squared error (MSE). Below is a visual representation of what the MSE is measuring. Essentially, this cost function takes the length of the yellow bars (the difference in values between the hypothesis and the actual x value), squares it, and then sums it across all data points (n times total).

Figure 2. Calculating MSE. The blue dots represent the actual data, the blue line is the hypothesis prediction, the red dots are the corresponding predictions, and the yellow bars are the difference in values between the blue dots and red dots.

When implementing gradient descent with the cost function, it simplifies the calculation to tweak the cost function a bit (shown below):

where m is still the number of data points

Now for the Python code implementation of the cost function.

# x is x data set (list)
# y is y data set (list)
# params is list of parameters
def cost(x,y, params):
theta0 = params[0]
theta1 = params[1]
sum = 0
for i in range(len(x)):
sum += math.pow(theta1*x[i]+theta0 - y[i],2)
cost = sum*ALPHA/len(x)/2 return cost

Gradient Descent with One Parameter

Our goal is to minimize the cost function. This is where we will implement a technique called gradient descent. First, let’s see if we can get the intuition behind gradient descent.

Let’s first say that we only had one parameter to find and our cost function was simplified to just J(theta) instead of J(theta0, theta1).

Now, for simplicity, let’s take J(theta) = theta2. Our goal is to find a value of theta that would minimize this function. In this case, the solution is fairly obvious: theta min = 0 would give us a minimum value.

Figure 2. Sample Graph of Cost Function J(theta) Versus Parameter Value theta. Credit to Desmos.

In gradient descent, we would first pick a random value of theta. For example, if the gradient descent algorithm started at theta = 2, then it would somehow eventually have to decrease theta until it reached the minimum value of 0. The way gradient descent achieves this is by taking the derivative of the cost function at theta, and then adjusting theta by its derivative multiplied by a scalar called the learning rate, represented by the character 𝛼.

This may look daunting at first, but if we take a look at the graph, the intuition is relatively straight forward. When theta initial is greater than theta min, the slope given by the derivative will be positive. When the slope (or derivative) times a constant is subtracted from theta initial, then the new value of theta is smaller and thus closer to the minimum of theta min = 0. Likewise, if theta initial is less than theta min, the slope given by the derivative will be negative, and when subtracted off theta initial, then the value of theta will increase.

Figure 3. Example of calculating parameters by using gradient descent.

Gradient Descent with Two Parameters

When there is more than one parameter, the calculus needed to minimize the cost function becomes a bit more involved. Here, we’ll need to use partial differentiation. Partial differentiation is similar to differentiation in that it tells us the behavior of slope and is calculated similarly. In three-space, partial differentiation allows us to calculate the slope with respect to both the x and y-direction.

Figure 4. Partial Differentiation²

Extrapolating from one parameter, we can see that the gradient descent algorithm for adjusting the parameters is similar to before.

Notice the := sign. We will adjust each parameter according to its respective partial derivative simultaneously. Because of this, I used a temporary array to store in the parameters. Below is the python code to update parameters.

for i in range(ITERATIONS):
# initialize temporary parameters --> simultaneous update
temp0 = params[0]
temp1 = params[1]
temp = [temp0, temp1]
# adjust parameters with gradient descent
params[0] = temp0 - costDeriv(x,y,temp, 0)
params[1] = temp1 - costDeriv(x,y,temp, 1)

The function costDeriv evaluates the expression

Mathematically, we use partial differentiation to evaluate the cost function. The steps are shown below.

Notice that the partial derivative with respect to theta1 is multiplied by x(i). This is because, in the hypothesis equation, theta1 is multiplied by x so taking the derivative results in x.

So the final equations are:

Figure 5. Gradient Descent Algorithm in Stanford’s Coursera³

I implemented the costDeriv in Python as follows:

def costDeriv(x, y, params, condition):
theta0 = params[0]
theta1 = params[1]
sum = 0
for i in range(len(x)):
if(condition == 0):
sum += (theta1*x[i]+theta0 - y[i])
else:
sum += (theta1*x[i]+theta0-y[i])*x[i]
adjustment = sum*ALPHA/len(x)
return adjustment

Determining a learning rate (𝛼): The learning rate, if too great, can overshoot the minimum and diverge. If too small, it can take a long time for gradient descent to converge. It was fun to play around with learning rates to see how well each one works. I will link an article HERE⁴ if you are interested in how to pick an optimal learning rate.

Determining the number of iterations: In determining the number of iterations, you want the function to ideally converge to two numbers for linear regression. Again, I played around with the number of iterations, but I will link another article HERE⁵ for a more in-depth explanation of iterations and convergence of linear regression.

Visualization of Linear Regression Using Gradient Descent

Figure 6. Error as a function of two parameters (for linear regression).

The figure below shows the path that gradient descent takes from an initial guess of parameters ([0,0]) to the convergence in i iterations.

Figure 7. Error as a function of parameters (blue) and the path of gradient descent (red)
Figure 8. Left: final linear approximation through data; Right: MSE vs # of iterations

Works Cited

[1]: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

[2]: Anton Bivens Davis Calculus Early Transcendentals (Chapter 13.3)

[3]: From Andrew Ng’s Machine Learning course offered by Coursera

[4]: https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/

[5]: https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf

--

--