Python Linear Regression with SciPy, SciKitLearn, TensorFlow1, and TensorFlow2

This is more of a reference for myself on the multitude of ways that something as computing simple linear regression can be done with Python, SciKitLearn, TensorFlow1, and TensorFlow2. Also, this is a bit of a stepping stone type of project, where the next steps involve non-linear regression, multi-non linear regression in the form of image classification.

Note: the current focus is just linear regression, other items such as extrapolation and accuracy are not really considered (however they still might be mentioned).

Brief Background…

We are trying to find a line that best matches our data (that is the line of best fit) which is in the form:

y = mx + b \quad \mathsf{or} \quad y = \beta x + \alpha \quad \mathsf{or} \quad y = w\cdot x + \beta

Where the slope is represented by m (or \beta or w for weight), and the y-intercept is represented by b (or \alpha or \beta for bias). Note different texts and domains use slightly different notation, but the idea is still the same.

Analytical Solutions…

The line of best fit problem can be solved a two different ways: analytically and numerically. For the analytical solution, there is a classic solution which can be found by using least squares:

\begin{aligned} \mathsf{Slope} &= \frac{(n \sum xy – \sum x \sum y)}{ n \sum x^2 – \left(\sum x\right)^2} \\ \mathsf{Intercept} &= \frac{\sum y \left(\sum x\right)^2 – \sum x \sum xy}{n \sum x^2 – \left(\sum x\right)^2} = \frac{\sum y}{n}\ -\ \mathsf{Slope} \times \frac{\sum x}{n} \end{aligned}

The coefficient of determination (R^2) or goodness of fit, can be determined to be:

\begin{aligned} R^2 &= \frac{\left(\overline{xy} – \bar{x}\bar{y}\right)^2}{\left(\overline{x^2} – \bar{x}^2\right) \left(\overline{y^2} – \bar{y}^2 \right)} \\ &= 1 – \frac{\sum \left(y_i – \left(mx_i + b\right)\right)^2}{\sum (y_i – \bar{y})} \end{aligned}

The Normal Equation

The normal equation is another analytical technique to determine the coefficients of the line of best fit. It is more generic than the previous analytical technique and can work with multiple features or variables.

First we establish a hypothesis function:

h(\theta) = \theta_0x_0 + \theta_1x_1 + \cdots + \theta_nx_n

Where the \theta values are the weights/bias (or slope and intercept) and x_i corresponds to a value for each specific feature in the dataset.

To keep things a little more simple, the hypothesis function can be written in matrix form as:

h(\theta) = \theta^{\mathsf{T}}X

The goal is to minimize the cost function which is the mean squared error:

\mathsf{MSE}(X, h) = \frac{1}{2m} \sum_{i=1}^m\left[ \theta^{\mathsf{T}}x^{(i)} – y^{(i)} \right]^2

The cost function basically is the sum of the residuals. Using matrices, it can be rewritten as…

\mathsf{MSE}(X, h) = \frac{1}{2m} \left[ X\theta – Y \vphantom{\frac{1}{1}}\right]^{\mathsf{T}}\left[ X\theta – Y \vphantom{\frac{1}{1}}\right]

Finding the optimal \theta value can be found by taking the derivative and setting it to zero:

\begin{aligned} \frac{\partial \mathsf{MSE}(X, h)}{\partial \theta} &= \frac{1}{2m} \frac{\partial }{\partial \theta} \left( \left[ X\theta – Y \vphantom{\frac{1}{1}}\right]^{\mathsf{T}}\left[ X\theta – Y \vphantom{\frac{1}{1}}\right] \right) \\ &= \frac{1}{2m} \left( 2X^{\mathsf{T}}X\theta – 2X^{\mathsf{T}}Y\right) \end{aligned}

And setting it equal to zero…

\begin{aligned} \frac{1}{2m} \left( 2X^{\mathsf{T}}X\theta – 2X^{\mathsf{T}}Y\right) &= 0 \\ X^{\mathsf{T}}X\theta – X^{\mathsf{T}}Y &= 0 \\ X^{\mathsf{T}}X\theta &= X^{\mathsf{T}}Y \\ \theta &= \left( X^{\mathsf{T}}X \right)^{-1} X^{\mathsf{T}}Y \end{aligned}

Solving with Python…

The above formulas for the slope and intercept can be written in whatever language very easily; a Python implementation is available here. A lot of other Python packages also provide an analytical solution.

Python – Numpy

In Numpy, the polyfit function is used to perform regression; where the x and y values are specified along with the degree of the curve. To interact with the line of best fit the poly1d function is used. Note that no R-Squared function and thus a separate function needs to be written/found.

import numpy as np
coefficients = np.polyfit (xValues, yValues, 1)
print ('Slope: ', coefficients[0], ' Intercept: ', coefficients[1])

bestFit = np.poly1d (coefficients)
yTest = bestFit (xTest)

An example implementation is available here.

Python – SciPy

SciPy which can be considered a superset package that contains Numpy and other libraries. It also contains a very comprehensive statistics module which contains a linear regression function that (also computes the R-squared value).

from scipy import stats
slope, intercept, rValue, pValue, stdErr = stats.linregress (xValues, yValues)

print ('Slope: ', slope, ' Intercept: ', intercept, ' RSquared: ', rValue**2)

An example implementation is available here.

Python – SciKit-Learn

SciKit-Learn contains many regression tools. Instead of just calling the solution a line of best fit, it is referred to as a model since it is more generic. Also this package is more in line for machine learning and has a variety of other methods including those to test the accuracy of the prediction.

from sklearn import linear_model
import numpy as np

# convert to Numpy arrays
xValues = np.array (xValues).reshape(-1, 1)
yValues = np.array (yValues).reshape(-1, 1)

# Create linear regression object and 'Train' the model using the training sets
linReg = linear_model.LinearRegression ()
linReg.fit (xValues, yValues)

# The coefficients
print ('Slope: ', linReg.coef_, ' Intercept: ', linReg.intercept_, 'RSquared: ', linReg.score (xValues, yValues))

Note that the R-Squared value can be computed two different ways: using the score function and the r2_score function from sklearn.metrics. Also, once the model has been fit, it can be evaluated using the predict function.

An example implementation is available here.

Python and the Normal Equation…

In order to have an intercept component, a column of ones needs to be added onto the list of x values. Note the numpy c_ function ‘Translates slice objects to concatenation along the second axis’; that is since it is such a common occurance to concatenate numpy matrices/vectors, a utility function exists to make the process that easier.

X = np.c_ [ np.ones ((len(xValues), 1)), xValues]

Then the theta values can be computed using the various linear algebra functions in numpy…

thetaBest = np.linalg.inv (X.T.dot (X)).dot (X.T).dot (yValues)

Finally, the results in the theta array correspond to the same indices for each feature, that is index 0 corresponds to x_0 which is the intercept, and index 1 corresponds to x_1 which is the slope.

Solving Numerically

Another technique when solving regression problems is using solving them numerically. The most often used technique involves the minimization of a cost (or objective or loss) function. One of the main benefits of having a numerical solution is that it can solve problems that analytical cannot, however there are a few issues including converging to a local solution or even diverging to no solution.

Gradient descent is perhaps the most or one of the more fundamental ways to solve regression problems. It can be adapted to a wide variety of other problems so it is frequently used and has implementation in a wide variety of mathematical frameworks. The key idea is to use the derivative of the cost function to go from a high start value to (hopefully) the minimal value.

There are a few important elements to use gradient descent. Instead of blindly just adding in the derivative each step, a scaling factor called the learning rate is used to control how quick the algorithm converges to the solution. If its too low, it will take a long time to converge, but if its too high it can step over the minimum (and possibly diverge and not find a solution). The other important element is to guarantee that the minimum found is the global minimum, the cost function needs to be convex function (like a parabola).

One very common cost functions used is the mean square error (or MSE). Mathematically, it is written as:

\mathsf{Loss\ or\ Cost}\ \Rightarrow\ \mathsf{MSE} = \frac{1}{N} \sum (y_i\ -\ \hat{y}_i)^2

Note that y_i and \hat{y}_i are the given and predicted values of y. Also, N is the number of values in the dataset.

Since \hat{y}_i involves two variables: the slope and intercept (recall y = w\cdot x + \beta), then two partial derivatives required. The math is easy enough and listed below:

\begin{aligned} \mathsf{MSE} &= \frac{1}{N} \sum (y_i\ -\ \underbrace{(w\cdot x_i + \beta)}_{\hat{y}_i} )^2 \\ \frac{\partial \mathsf{MSE}}{\partial w} &= \frac{1}{N} \sum -2x_i (y_i\ -\ (w\cdot x_i + \beta) ) \\ & \\ \frac{\partial \mathsf{MSE}}{\partial \beta} &= \frac{1}{N} \sum -2 (y_i\ -\ (w\cdot x_i + \beta) ) \\ \end{aligned}

These values are then computed and used for each ‘learning’ step along with \alpha, the learning rate, as follows:

\begin{aligned} w_\mathsf{current} &= w_\mathsf{previous} – \alpha \cdot \frac{\partial \mathsf{MSE}}{\partial w} \left( w_\mathsf{previous}, \beta_\mathsf{previous}\right) \\ & \\ \beta_\mathsf{current} &= \beta_\mathsf{previous} – \alpha \cdot \frac{\partial \mathsf{MSE}}{\partial \beta} \left( w_\mathsf{previous}, \beta_\mathsf{previous}\right) \\ \end{aligned}

The overall process is repeated either for a certain number of times or until there is a very small change in the loss function (this threshold is usually labeled as epsilon or \epsilon).

A more generic technique that can be used with multi-regression problems is to deal with all of the \theta at once in the form of a vector. Then the derivative (or gradient vector) becomes:

\nabla_\theta \mathsf{MSE}(\theta) = \left[ \begin{array}{c} \frac{\partial}{\partial \theta_0} \mathsf{MSE}(\theta) \\ \frac{\partial}{\partial \theta_1} \mathsf{MSE}(\theta) \\ \vdots \\ \frac{\partial}{\partial \theta_n} \mathsf{MSE}(\theta) \\ \end{array} \right] = \frac{2}{m}X^{\mathsf{T}}(X\theta – y)

Note that there are a few variations with the gradient descent algorithm. Batch gradient descent involves performing the algorithm on the whole data set each iteration (and is used in our implementation). Stochastic gradient descent uses a random sample of the data for each iteration. This is preferred method when dealing with large scale datasets. Finally, mini-batch gradient descent computes the gradient over small sub-sets (called mini-batches).

An example implementation is available here.

TensorFlow

TensorFlow is another animal altogether when solving regression problems. While it has all the optimization infrastructure to numerically solve such problems, it is better suited for larger scale problems (that involve neural networks and parallelization). With linear regression, on a slower machine with a very simple data set, just loading the needed packages and building the model can take longer than using the analytical solution.

TensorFlow 1

The TensorFlow 1 (TF1) solution is only being added for completeness since most of the time TensorFlow 2 (TF2) is used since it is much simpler and easier to use (especially with Keras). Also, there are a lot of existing solutions out there that use TF1 so it is good to have some awareness of how the older system works.

With TF1 there are a few extra steps that need to be done. First, data and variables need to be declared. Note that it is perhaps not best practice to use tf.constant variables instead of tf.placeholders, but for solving linear regression in a system that isn’t used much anymore, its fine.

import numpy as np
import tensorflow as tf

# constants
constX = tf.constant (xValues)
constY = tf.constant (yValues)

# variables
varSlope = tf.Variable (tf.random.normal([]))
varInter = tf.Variable (tf.random.normal([]))

Next the model and loss function is defined. Note that the model is just the equation of a line while loss is the mean square error (MSE), that is:

# Compute model and loss
model    = tf.add (tf.multiply (constX, varSlope), varInter)
lossFunc = tf.reduce_mean (tf.pow (model - constY, 2))

Then the optimizer is declared. Note that a few different types of optimizers can be set (such as gradient, momentum, and adam). Also, the learning rate is also specified at the same time.

learn_rate  = 0.001
optimizer = tf.train.AdamOptimizer (learn_rate).minimize (lossFunc)

Finally a session is instantiated and run until the loss is less than the threshold.

with tf.Session () as sess:
sess (tf.global_variables_initializer ())

while (1):
_, loss = sess.run ( [optimizer, lossFunc] )

deltaLoss = abs (previousLoss - loss)
solvedM   = sess.run (varSlope)
solvedB   = sess.run (varInter)

if epoch % 1000 == 0:
print ('    -> epoch = ', epoch,  ' m = ', solvedM, ' b = ', solvedB, ' loss = ', loss, ' deltaLoss = ', deltaLoss)

if (deltaLoss < LOSS_THRESHOLD):
break

previousLoss = loss
epoch += 1

An example implementation is available here.

TensorFlow 2

The TF2 solution is both simpler and more complicated. One doesn’t necessarily need to worry about creating the correct type of TensorFlow variable or session. However, there are other items such as making sure that the gradient tape object is properly (one can not have persistent instances of the gradient tape).

A class is used to keep track of the slope and intercept (or weight and bias).

class LinearModel:
def __init__ (self):
self.Slope = tf.Variable (1.0)   # note slope is the weight
self.Inter = tf.Variable (1.0)   # note inter is the bias

def __call__ (self, x):
return self.Slope * x + self.Inter

Next the same loss function used again…

def lossFunc (yActual, yPred):
return tf.reduce_mean (tf.square (yActual - yPred))

A separate training function is used to update the slope and intercept. Note that the default behaviour of the gradient tape object is to be non-persistent. By keeping it in a separate function, its lifecycle is guaranteed to be destroyed properly.

def train (linModel, x, y, learnRate=0.0001):

currLoss = lossFunc (y, linModel(x))

dSlope, dInter = t.gradient (currLoss, [linModel.Slope, linModel.Inter])

linModel.Slope.assign_sub (learnRate * dSlope)
linModel.Inter.assign_sub (learnRate * dInter)

return currLoss

Finally, the main training function is pretty simple…

epoch    = 1
prevLoss = 9e9
linModel = LinearModel ()

while (1):

currLoss = train (linModel, xValues, yValues, learningRate)

if epoch % 1000 == 0:
print (f"Epoch: {epoch}  Loss value: {currLoss.numpy()}  Slope: {linModel.Slope.numpy()}  Intercept: {linModel.Inter.numpy()}")

deltaLoss = abs (currLoss - prevLoss)
if deltaLoss < MIN_DELTA or epoch > MAX_EPOCHS:
break

prevLoss = currLoss
epoch += 1

print (linModel.Slope.numpy(), linModel.Inter.numpy())

An example implementation is available here.

TensorFlow has other solutions to solve linear regression, specifically the estimator class. It should make solving problems much easier, but there are a fair bit of problems with it. Some of these include incompatibility with various methods and data structures between various versions of TensorFlow (TF1, TF2, and the current version). Some of the compatibility issues are addressed by prepending ‘tf.compat.v1’ onto various offending code. Other issues can be solved by migrating code so that is used ‘tf.data’. Since the code has changed so much, I believe for now the above existing solutions are good for now.

Notes on Regularization

With multivariate regression (that is when dealing with multiple variables), there is a chance that a model can overfit the data. Regularization makes it harder for the model to overfit the data. With a linear model, regularization constrains the model’s weights.

Ridge Regularization

An additional term is added to the cost function which forces the learning algorithm to keep the model weights as small as possible (along with fitting the model). Note that Ridge regularization is also known as Tikhonov regularization.

The cost function now becomes:

J(\theta) = \mathsf{MSE}(\theta) + \alpha\frac{1}{2}\sum_{i = 1}^n\theta^2_i

Note:

• The bias term (\theta_0) is not regularized.
• It is important to scale the data before using this (or any regularization)
Lasso Regularization

First we should note that Lasso Regression stands for Least Absolute Shrinkage and Selection Operator Regression. In this case, the added regularization term uses the \ell_1 norm instead of one half of the \ell_2 norm.

The Lasso cost function becomes…

J(\theta) = \mathsf{MSE}(\theta) + \alpha\sum_{i = 1}^n\left|\theta_i\right|
Elastic Net Regularization

Elastic net is a mix between Ridge and Lasso regularization. The mix is controlled by a r value (which is bound between 0 and 1).

The cost function is a mix of both…

J(\theta) = \mathsf{MSE}(\theta) + r\alpha\sum_{i = 1}^n\left|\theta_i\right| + \frac{1 – r}{2}\sum_{i = 1}^n\theta^2_i