Non-Linear Regression… using Python, Javascript, NumPy, and TensorFlow

This is a continuation to the previous linear regression document. However this time the focus is on a few different ways to perform non-linear regression using Javascript and various Python frameworks (such as NumPy, SciKitLearn, and TensorFlow).

Quadratic Regression

Perhaps this is the first logical step beyond linear regression. A parabola of best fit is found for the dataset as opposed to a line of best fit. We should note that quadratic regression is not necessarily a regression problem, instead it is usually referred to as curve fitting. Below is the equation of a parabola, and the goal is to determine the coefficients $a, b$ and $c$ .

y = ax^2 + bx + c

To be brief the fundamental math is omitted, instead just the final solution is shown. By constructing the following matrix and then solving the resulting equations, the coefficients ( $a, b, c$ ) can easily be found:

\left[ \begin{array}{ccc} \sum x_i^4 & \sum x_i^3 & \sum x_i^2 \\ \sum x_i^3 & \sum x_i^2 & \sum x_i \\ \sum x_i^2 & \sum x_i & n \\ \end{array} \right] \left[ \begin{array}{c} a \\ b \\ c \end{array} \right] = \left[ \begin{array}{c} \sum x_i^2y_i \\ \sum x_iy_i \\ \sum y_i \end{array} \right]

Also the R² value can be found by:

R^2 = 1\ -\ \frac{\sum \left(y_i\ -\ (ax_i^2 + bx_i + c)\right)^2}{\sum \left(y_i -\bar{y}\right)^2}

Python Solution…

There is no premade quadratic regression function in NumPy or SciKit learn since there is a better and more generic solution that works for any polynomial (which will be examined a bit further in the next section).

However, NumPy makes it very easy to solve the above equations since it contains some nice of matrix functions.

sum_x  = 0
sum_x2 = 0
sum_x3 = 0
sum_x4 = 0

sum_y   = 0
sum_xy  = 0
sum_x2y = 0

for point in points:
    sum_x  += point[0]
    sum_x2 += point[0] ** 2
    sum_x3 += point[0] ** 3
    sum_x4 += point[0] ** 4
        
    sum_y   += point[1]
    sum_xy  += point[1] * point[0]
    sum_x2y += point[1] * point[0] ** 2

A = np.matrix ( [ [ sum_x4, sum_x3, sum_x2 ], \
                  [ sum_x3, sum_x2, sum_x  ], \
                  [ sum_x2, sum_x,  len(points) ] ] )

B = np.matrix ( [ sum_x2y, sum_xy, sum_y ] )

Ainv = np.linalg.inv (A)

coefs = B * Ainv 

print ('a (x^2) --> ', coefs.item(0))
print ('b (x^1) --> ', coefs.item(1))
print ('c (x^0) --> ', coefs.item(2))

Javascript Solution…

A small app was made using Javascript so one can interactively see the fitting results (click here to view it in a separate window). To calculate the matrices, the math.js library was used.

Polynomial Regression/Fitting…

Polynomial fitting is a bit more generic since the goal is to try to fit a polynomial of whatever specified degree ( $n$ ) to determine the values of each $\alpha$ ) to best fit the data points.

y = \alpha_n x^n + \alpha_{n-1} x^{n-1} + \cdots + \alpha_{1} x +\alpha_{0}

Vandermonde Matrix Technique

There are a few different ways to solve such a problem, perhaps the most efficient is to use the Vandermonde matrix (VM) based technique. First the VM matrix is found by taking powers from 0 to the required degree of all $m$ given $x$ values:

V = \left[ \begin{array}{cccc} 1 & x_1 & x_1^2 & \cdots & x_1^{n-1} \\ 1 & x_2 & x_2^2 & \cdots & x_2^{n-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^{n-1} \end{array} \right]

Then using the VM matrix, the $\alpha$ values can be determined by some matrix manipulation. Note that both $\alpha$ and $y$ are column vectors of lengths $n$ and $m$ respectively.

\begin{aligned} y &= V \alpha \\ \alpha &= \left(V^TV\right)^{−1}V^𝑇𝑦 \end{aligned}

A python solution can easily be made using Numpy’s matrix related functions. A Javascript based solution is given below:

Numpy Solution

Numpy polyfit function finds the polynomial of best fit using the Vandermonde matrix. Hence, things are a bit easier than the previous solution:

import numpy as np
coefs = np.polynomial.polynomial.polyfit (np.array(xPoints),
                                          np.array(yPoints),
                                          maxDegree)

for i in range(len(coefs)):
    print (f'x^{i}  -->  {coefs[i]}')

SciPy Solution

SciPy does contain the curve_fit function which can be used to find the a quadratic function of best fit (as seen in the code below). It does not use the above matrix technique, rather it uses non-linear least squares to fit a given function to the data. Note that the function can be anything other than just a polynomial.

import scipy.optimize 

def QuadraticFunc (x, a, b, c):
        return  a * x**2 +  b * x  +  c

results = scipy.optimize.curve_fit (QuadraticFunc, xPoints, yPoints)
coefs = results[0]

print ('x^0 --> ', coefs[2])
print ('x^1 --> ', coefs[1])
print ('x^2 --> ', coefs[0])

SciKit Learn

SciKit Learn also uses non-linear least squares, however the code is slightly more complicated since it uses two elements: PolynomialFeatures and LinearRegression….

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

polyFeatures = PolynomialFeatures (degree=maxDegree, include_bias=False)
polyX = polyFeatures.fit_transform (xPoints)
model = LinearRegression (fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

model.fit (polyX, yPoints) 

print (f'x^0  -->  {model.intercept_.item(0)}')
for i in range (len(model.coef_)+1):
    print (f'x^{i+1}  -->  {model.coef_.item(i)}')

Gradient Descent

Gradient descent is a numerical technique used to perform optimization on a particular function (the objective/cost/loss function). The main goal is to adjust the parameters so that the cost function is minimized. While there are a variety of numeric optimization techniques out there, gradient descent is fairly simple and so it’s easy to implement. However there are some problems with it, namely the cost function needs to be convex, and the learning rate cannot be so large that it steps over a possible solution.

Recall that we are trying to find the polynomial $P$ that best fits the dataset. Note that is is up to the user to know what maximum number of degrees ( $n$ ) to use so that the results do not under or overfit the data.

P(x) = a_nx^n + a_{n-1}x^{n-1} + \cdots + a_1x + x_0

Next the loss (or cost) function ( $\ell$ ) that is used is simply the mean squared error function and is defined to as:

\ell = \frac{1}{2m}\sum_{i = 1} ^m\left( P(x_i)\ -\ y_i\right)^2

Two things to note about $\ell$ are: it is computed for all the given points in the dataset (from 1 to $m$ ); next there is an extra 2 in the denominator so that things are slightly neater when the derivative is computed.

During each step in the gradient descent processes, each $a$ coefficient value is recomputed using its current value along with the derivative of the loss function and the learning rate (or $\alpha$ ) in the following manner:

_{\mathsf{next}}a_j\ =\ _{\mathsf{current}}a_j\ -\ \alpha \frac{\partial \ell}{\partial a_j}\quad\quad \mathrm{Where}\ j = \{0, 1, \ldots, n\}

Finding the derivative (or gradient) of the loss function we get:

\begin{aligned} \frac{\partial \ell}{\partial a_j} &= \frac{\partial}{\partial a_j} \frac{1}{2m}\sum_{i = 1} ^m\left( P(x_i)\ -\ y_i\right)^2 \\ &= \frac{1}{m} \sum_{i = 1} ^m \left( P(x_i)\ -\ y_i\right) \cdot \frac{\partial P}{\partial a_j} \\ &= \frac{1}{m} \sum_{i = 1} ^m \left( P(x_i)\ -\ y_i\right) \cdot x^j \quad\quad j = \{0, 1, \ldots, n\} \end{aligned}

Therefore, at each gradient descent step, for all coefficient values we need to do the following calculations:

\begin{aligned} _{\mathsf{next}}a_0 &=\ _{\mathsf{current}}a_0\ -\ \frac{\alpha}{m} \sum_{i=1}^m \left( P(x_i)\ -\ y_i\right) \cdot 1\\ _{\mathsf{next}}a_1 &=\ _{\mathsf{current}}a_1\ -\ \frac{\alpha}{m} \sum_{i=1}^m \left( P(x_i)\ -\ y_i\right) \cdot x_i \\ \vdots\quad &=\quad \vdots \quad\quad\quad \vdots \quad\quad\quad \vdots \quad\quad\quad \vdots \quad\quad\quad \vdots \\ _{\mathsf{next}}a_{n-1} &=\ _{\mathsf{current}}a_{n-1}\ -\ \frac{\alpha}{m} \sum_{i=1}^m \left(P(x_i)\ -\ y_i\right) \cdot x_i^{n-1} \\ _{\mathsf{next}}a_n &=\ _{\mathsf{current}}a_n\ -\ \frac{\alpha}{m} \sum_{i=1}^m \left(P(x_i)\ -\ y_i\right) \cdot x_i^n \\ \end{aligned}

TensorFlow Solution

There are a few components in Tensorflow that can make solving such regression problems in a similar manner as the gradient descent technique. Instead of computing the derivative ourselves, TensorFlow 2 has the gradientTape functionality that computes the gradient numerically for all coefficients.

The main steps of with TensorFlow in solving a polynomial fitting problem are listed below:

import tensorflow as tf

coefs = tf.Variable ([0.0] * NUM_COEFFS, name='coefficients')

def PolyFunc (x):
   yValue = 0
   for i in range (numCoeffs): 
      yValue += coefs[i] * x ** i 
   return yValue

def CostFunc (yActual, yPred):
    return tf.reduce_mean (tf.square (yActual - yPred))

for epoch in range (MAX_EPOCHS+1):
   with tf.GradientTape () as tape:
      yPredicted = PolyFunc (xValues)
      cost = CostFunc (yPredicted, yValues)

   gradients = tape.gradient (cost, coefs)
   coefs.assign_sub (alpha * gradients)

for i in range(len(coefs.numpy().tolist())):
    print (f'x^{i} --> {coefs.numpy().tolist()[i]}')

Note that only a TensorFlow 2 solution was created. We are moving on from TensorFlow 1.

Logistic Regression

Logistic regression is perhaps one of the most important types of regression methods out there. It is used in the modelling the probability of a certain class or event happening such as pass/fail or classifying something (such if food is a hot dog and not hot dog).

Again logistic regression can be used to classify two categories, the sigmoid function ( $\sigma$ ) is used to produce two distinct classes from various input values. Note that for multi-category logistic regression, the softmax function is used instead:

\sigma = \frac{1}{1 + e^{-z}}

Where $z$ is a linear function comprised of weights ( $w_i$ ) of each feature ( $x_i$ ) and a bias ( $b$ ) term to adjust the fitting.

z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

To keep things simple, all of our solutions only have one feature so we only need to solve for one weight value, that is:

z = w x + b

The goal with logistic regression is to determine the weights and bias terms which then assist in determining the classes/categories. Since the sigmoid function produces probabilities, then to determine what class a prediction is, the value of 0.5 is typically used; that is:

\begin{aligned} \sigma(x) <= 0.5\quad &\Rightarrow \quad 0\ \mathrm{or\ Class\ A}\ \mathrm{or\ False} \\ \sigma(x) > 0.5\quad &\Rightarrow \quad 1\ \mathrm{or\ Class\ B}\ \mathrm{or\ True} \\ \end{aligned}

In the following Javascript program, one can play around with the weight and bias and see how it impacts the resulting sigmoid curve. Another important thing to notice is that different solution values can produce the same prediction outcome after the 0.5 filtering value is applied. Finally, since logistic regression applies to cases with two classes, then any new points are ‘snapped’ to the $y = 0$ or $y = 1$ lines based on their proximity.

Gradient Descent Solution…

Manipulating the sigmoid function, one can quickly see that there is no analytical solution to logistic regression problems. Hence it needs to be solved numerically. Perhaps the most popular way is to use gradient descent.

The main difference when using gradient descent when solving logistic regression is the cost function. Instead of using the simple mean square error function, the cross entropy function is used in this case. It can simply be derived to give:

L_{CE}(y_{\mathsf{pred}},y_{\mathsf{act}}) = – \left[y_{\mathsf{act}} \log\underbrace{\sigma(wx + b)}_{y_{\mathsf{pred}}} + (1 -y_{\mathsf{act}}) \log \underbrace{(1 – \sigma(wx + b))}_{1\ -\ y_{\mathsf{pred}}} \right]

The derivative of $L_{CE}$ uses a few tricks related to logs and the sigmoid function to give:

\frac{\partial L_{CE}}{\partial w_j} = \left[ \sigma(wx + b)\ -\ y\right]x_j

SciKit Learn Solution…

SciKit Learn also uses gradient descent to solve the problem, however it makes things easy since no derivatives need to be determined by hand or any learning rate parameters need to be tweaked. Everything is nice and clean and contained in the LogisticRegression object.

import numpy as np
from sklearn.linear_model import LogisticRegression
    
logReg = LogisticRegression (random_state=0, solver='lbfgs')
logReg.fit (X, y)

print (f'Coefs: {logReg.coef_}')
print (f'Intercept: {logReg.intercept_}')

print (f'Actual  Results: {y}')
print (f'Predict Results: {logReg.predict (X)}')
print (f'Model Score: {logReg.score (X, y)}')

StatsModel Solution…

The StatsModels package also contains a logistic regression solution however it isn’t as nice as the SciKit Learn version. The main problem is that by default it assumes for the bias to be 0.5. To change this, the add_constant function needs to be called to add a column of ones to the dataset.

import numpy as np
import statsmodels.api as sm

X, y = loadData ( <DataFile> )

# uncomment to add a constant to allow for a different x-intercept
# X = sm.add_constant (X, prepend=True)

# solve and report the results...
logit = sm.Logit (y, X).fit()

print (logit.summary())
print (logit.predict (X))
print (logit.predict (X) > 0.5)
print (y)

TensorFlow2 Solution…

TensorFlow2 has a built-in sigmoid function that is used as an activation function. However, for practice a gradient descent technique was used for our solution, as seen below…

import tensorflow as tf
import numpy as np

# Key parameters
ALPHA      = 0.001
MAX_EPOCHS = 2000

# Weight and bias
tf.random.set_seed (2021)
W = tf.Variable (tf.random.normal ([1], mean=0.0))
b = tf.Variable (tf.random.normal ([1], mean=0.0))

# Hypothesis/Prediction Function
def predict (X):
    z = tf.math.multiply (X, W) + b
    pred = 1 / (1 + tf.exp(-z))
    return pred 

# Training... that is solve using gradient descent
for e in range (MAX_EPOCHS+1):
    with tf.GradientTape() as tape:
        hypothesis = predict (X)
        cost = tf.reduce_mean (-tf.reduce_sum (y * tf.math.log (hypothesis) + (1 - y) * tf.math.log (1 - hypothesis)))        
            W_grad, b_grad = tape.gradient (cost, [W, b])
            W.assign_sub (ALPHA * W_grad)
            b.assign_sub (ALPHA * b_grad)
 
# report results...
results = predict (X).numpy().tolist()
for i, value in enumerate (results):
    print (f'Probability: {value[0]:.5f} -->  {(value[0] > 0.5)}   Actual: {y[i] > 0.5}')

In the code, only one weight is used; if multiple weights are needed then the following changes need to be made (to make sure that matrices are used):

# N needs to be number of weights
W = tf.Variable (tf.random.normal ([N,1], mean=0.0))

# and when computing z...
z = tf.matmul (X, W) + b

Code…

All the code for this project (both in Python and Javascript) is available here.

DataScience Javascript Python