Open Educational Resources

Curve Fitting: Extension of Linear Regression

Extension of Linear Regression

In the previous section, the model function y was linear in x. However, we can have a model function y that is linear in the unknown coefficients but non-linear in x. In a general sense, the model function y can be composed of m terms with the following form:

    \[y(x)=a_1f_1(x)+a_2f_2(x)+a_3f_3(x)+\cdots+a_mf_m(x)=\sum_{j=1}^ma_jf_j(x)\]

Note that the linear regression model can be viewed as a special case of this general form with only two functions f_1(x)=1 and f_2(x)=x. Another special case of this general form is polynomial regression where the model function has the form:

    \[y(x)=a_0+a_1x+a_2x^2+a_3x^3+\cdots+a_mx^m\]

The regression procedure constitutes finding the coefficients a_j that would yield the least sum of squared differences between the data and model prediction. Given a set of data (x_i,y_i) with 1\leq i \leq n, and if S is the sum of the squared differences between a general linear regression model and the data, then S has the form:

    \[S=\sum_{i=1}^n\left(y(x_i)-y_i\right)^2=\sum_{i=1}^n\left(\sum_{j=1}^ma_jf_j(x_i)-y_i\right)^2\]

To find the minimizers of S, the derivatives of S with respect to each of the coefficients a_j can be equated to zero. Taking the derivative of S with respect to an arbitrary coefficient a_k and equating to zero yields the general equation:

    \[\frac{\partial S}{\partial a_k}=\sum_{i=1}^n\left(2\left(\sum_{j=1}^ma_jf_j(x_i)-y_i\right)f_k(x_i)\right)=0\]


A system of m-equations of the m unknowns a_1,a_2,\cdots,a_m can be formed and have the form:

    \[\begin{split}\left(\begin{matrix}\sum_{i=1}^nf_1(x_i)f_1(x_i)&\sum_{i=1}^nf_1(x_i)f_2(x_i)&\cdots & \sum_{i=1}^nf_1(x_i)f_m(x_i)\\\sum_{i=1}^nf_2(x_i)f_1(x_i)&\sum_{i=1}^nf_2(x_i)f_2(x_i)&\cdots & \sum_{i=1}^nf_2(x_i)f_m(x_i)\\\vdots&\vdots&\ddots & \vdots\\\sum_{i=1}^nf_m(x_i)f_1(x_i)&\sum_{i=1}^nf_m(x_i)f_2(x_i)&\cdots & \sum_{i=1}^nf_m(x_i)f_m(x_i)\end{matrix}\right)&\left(\begin{array}{c}a_1\\a_2\\\vdots\\a_m\end{array}\right)\\&=\left(\begin{array}{c}\sum_{i=1}^ny_if_1(x_i)\\\sum_{i=1}^ny_if_2(x_i)\\\vdots\\\sum_{i=1}^ny_if_m(x_i)\end{array}\right)\end{split}\]

It can be shown that the above system always has a unique solution when the functions f_i(x) are non-zero and distinct. Solving these equations yields the best fit to the data, i.e., the best coefficients a_i that would minimize the sum of the squares of the differences between the model and the data. The coefficient of determination can be obtained as described above.

Example 1

Consider the data (1,0.5), (2,2.5), (3,2), (4,4), (5,3.5), (6,6), (7,5.5). Consider a model of the form

    \[y=a_1+a_2x+a_3 \cos{(\pi x)} \]


Find the coefficients a_1, a_2, and a_3 that would give the best fit.

Solution

This model is composed of a linear combination of three functions: f_1(x)=1, f_2(x)=x, and f_3(x)=\cos{(\pi x)}. To find the best coefficients, the following linear system of equations needs to be solved:

    \[\begin{split}\left(\begin{matrix}\sum_{i=1}^nf_1(x_i)f_1(x_i)&\sum_{i=1}^nf_1(x_i)f_2(x_i)& \sum_{i=1}^nf_1(x_i)f_3(x_i)\\\sum_{i=1}^nf_2(x_i)f_1(x_i)&\sum_{i=1}^nf_2(x_i)f_2(x_i)&\sum_{i=1}^nf_2(x_i)f_3(x_i)\\\sum_{i=1}^nf_3(x_i)f_1(x_i)&\sum_{i=1}^nf_3(x_i)f_2(x_i)&\sum_{i=1}^nf_3(x_i)f_3(x_i)\end{matrix}\right)&\left(\begin{array}{c}a_1\\a_2\\a_3\end{array}\right)\\&=\left(\begin{array}{c}\sum_{i=1}^ny_if_1(x_i)\\\sum_{i=1}^ny_if_2(x_i)\\\sum_{i=1}^ny_if_3(x_i)\end{array}\right)\end{split}\]

The following Microsoft Excel table is used to calculate the entries in the above matrix equation

linearfittable

Therefore, the linear system of equations can be written as:

    \[\left(\begin{matrix}7&28&-1\\28&140&-4\\-1&-4&7\end{matrix}\right) \left(\begin{array}{c}a_1\\a_2\\a_3\end{array}\right)=\left(\begin{array}{c}24\\119.5\\1\end{array}\right) \]

Solving the above system yields a_1=0.16369, a_2=0.83929, a_3=0.64583. Therefore, the best-fit model has the form:

    \[y(x)=0.16369+0.83929x+0.64583\cos{(\pi x)}\]

To find the coefficient of determination, the following Microsoft Excel table for the values of (y_i-y(x_i))^2 and \left(y_i-\frac{1}{n}\sum_{i=1}^ny_i\right)^2 is used:

linearfittable2
Therefore:

    \[R^2=1-\frac{0.13095}{22.71429}=0.994234 \]


R^2 is very close to 1 indicating a very good fit!

The Mathematica function LinearModelFit[Data,{functions},x] does the above computations and provides the required model. The equation of the model can be retrieved using the built-in function Normal and the R^2 can also be retrieved as shown in the code below. The final plot of the model vs. the data is shown below as well.

linearfit2

View Mathematica Code
Clear[a, b, x]
Data = {{1, 0.5}, {2, 2.5}, {3, 2}, {4, 4.0}, {5, 3.5}, {6, 6.0}, {7, 5.5}};
model = LinearModelFit[Data, {1, x, Cos[Pi*x]}, x]
y = Normal[model]
R2 = model["RSquared"]
Plot[y, {x, 1, 7}, Epilog -> {PointSize[Large], Point[Data]},  PlotLegends->{"Model"},AxesLabel -> {"x", "y"}, AxesOrigin -> {0, 0} ]
View Python Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

Data = [[1, 0.5], [2, 2.5], [3, 2], [4, 4.0], [5, 3.5], [6, 6.0], [7, 5.5]]
def f(x, a, b, c): return a + b*x + c*np.cos(np.pi*x)
coeff, covariance = curve_fit(f, [point[0] for point in Data],
                                 [point[1] for point in Data])
print("coeff: ",coeff)
x_val = np.arange(0,7,0.01)
plt.title("%.5f + %.5fx + %.5fcos(pi*x)" % tuple(coeff))
plt.plot(x_val, f(x_val, coeff[0], coeff[1], coeff[2]))
plt.scatter([point[0] for point in Data], [point[1] for point in Data], c='k')
plt.xlabel("x"); plt.ylabel("y")
plt.grid(); plt.show()

# R squared
x = np.array([point[0] for point in Data])
y = np.array([point[1] for point in Data])
y_fit = f(x, coeff[0], coeff[1], coeff[2])
ss_res = np.sum((y - y_fit)**2)
ss_tot = np.sum((y - np.mean(y))**2)
r2 = 1 - (ss_res / ss_tot)
print("R Squared: ",r2)

The following links provide the MATLAB codes for implementing the generalized least squares. The function for the model in this example in provided in File 2.

Example 2

Fit a cubic polynomial to the data (1,1.93),(1.1,1.61),(1.2,2.27),(1.3,3.19),(1.4,3.19),(1.5,3.71),(1.6,4.29),(1.7,4.95),(1.8,6.07),(1.9,7.48),(2,8.72),(2.1,9.34),(2.2,11.62).

Solution

A cubic polynomial fit would have the form:

    \[y(x)=a_0+a_1x+a_2x^2+a_3x^3\]

This is a linear combination of the functions f_0(x)=1, f_1(x)=x, f_2(x)=x^2, and f_3(x)=x^3. The following system of equations needs to be formed to solve for the coefficients a_0, a_1, a_2, and a_3:

    \[\begin{split}\left(\begin{matrix}\sum_{i=1}^nf_0(x_i)f_0(x_i)&\sum_{i=1}^nf_0(x_i)f_1(x_i)&\sum_{i=1}^nf_0(x_i)f_2(x_i) & \sum_{i=1}^nf_0(x_i)f_3(x_i)\\\sum_{i=1}^nf_1(x_i)f_0(x_i)&\sum_{i=1}^nf_1(x_i)f_1(x_i)&\sum_{i=1}^nf_1(x_i)f_2(x_i) & \sum_{i=1}^nf_1(x_i)f_3(x_i)\\\sum_{i=1}^nf_2(x_i)f_0(x_i)&\sum_{i=1}^nf_2(x_i)f_1(x_i)&\sum_{i=1}^nf_2(x_i)f_2(x_i) & \sum_{i=1}^nf_2(x_i)f_3(x_i)\\\sum_{i=1}^nf_3(x_i)f_0(x_i)&\sum_{i=1}^nf_3(x_i)f_1(x_i)&\sum_{i=1}^nf_3(x_i)f_2(x_i) & \sum_{i=1}^nf_3(x_i)f_3(x_i)\\\end{matrix}\right)&\left(\begin{array}{c}a_0\\a_1\\a_2\\a_3\end{array}\right)\\&=\left(\begin{array}{c}\sum_{i=1}^ny_if_0(x_i)\\\sum_{i=1}^ny_if_1(x_i)\\\sum_{i=1}^ny_if_2(x_i)\\\sum_{i=1}^ny_if_3(x_i)\end{array}\right)\end{split}\]

The following Microsoft Excel tables are used to find the entries of the above linear system of equations:

example21
example22

Therefore, the linear system of equations can be written as:

    \[ \left(\begin{matrix}13&20.8&35.1&61.984\\20.8&35.1&61.984&113.607\\35.1&61.984&113.607&214.502\\61.984&113.607&214.502&414.623\end{matrix}\right)\left(\begin{array}{c}a_0\\a_1\\a_2\\a_3\end{array}\right)=\left(\begin{array}{c}68.37\\123.638\\231.4056\\444.8628\end{array}\right)\]

Solving the above system yields a_0=-1.6247, a_1=6.6301, a_2=-5.7273, and a_3=2.46212. Therefore, the best-fit model has the form:

    \[y=-1.6247+6.6301x-5.7273x^2+2.46212x^3\]

The following graph shows the Microsoft Excel plot with the generated cubic trendline. The trendline equation generated by Excel is the same as the one above.

example23

To find the coefficient of determination, the following Microsoft Excel table for the values of (y_i-y(x_i))^2 and \left(y_i-\frac{1}{n}\sum_{i=1}^ny_i\right)^2 is used:
example24

Therefore:

    \[R^2=1-\frac{0.9506}{120.0129}=0.9921\]

Which is similar to the one produced by Microsoft Excel. Alternatively, the following Mathematica code does the above computations to produce the model and its R^2.

View Mathematica Code
Data = {{1, 1.93}, {1.1, 1.61}, {1.2, 2.27}, {1.3, 3.19}, {1.4,3.19}, {1.5, 3.71}, {1.6, 4.29}, {1.7, 4.95}, {1.8, 6.07}, {1.9,7.48}, {2, 8.72}, {2.1, 9.34}, {2.2, 11.62}};
model = LinearModelFit[Data, {1, x, x^2, x^3}, x]
y = Normal[model]
R2 = model["RSquared"]
Plot[y, {x, 1, 2.2}, Epilog -> {PointSize[Large], Point[Data]}, PlotLegends -> {"Model"}, AxesLabel -> {"x", "y"}, AxesOrigin -> {0, 0} ]
View Python Code
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

Data = [[1, 1.93], [1.1, 1.61], [1.2, 2.27], [1.3, 3.19], [1.4,3.19], [1.5, 3.71], [1.6, 4.29], [1.7, 4.95], [1.8, 6.07], [1.9,7.48], [2, 8.72], [2.1, 9.34], [2.2, 11.62]]
coeff = np.polyfit([point[0] for point in Data],
                   [point[1] for point in Data], 3)
print("coeff: ",coeff)
slope, intercept, r_value, p_value, std_err = stats.linregress([point[0] for point in Data],
                                                               [point[1] for point in Data])
print("R Squared: ",r_value**2)
y = np.poly1d(coeff)
x = np.linspace(1, 2.2)
plt.plot(x, y(x),label="Model")
y = [str(round(i,4)) for i in list(y)]
plt.title(y[0]+"x**3 + "+y[1]+"x**2 + "+y[2]+"x + "+y[3])
plt.scatter([point[0] for point in Data],[point[1] for point in Data], c='k')
plt.xlabel("x"); plt.ylabel("y")
plt.legend(); plt.grid(); plt.show()

The following links provide the MATLAB codes for implementing the cubic polynomial fit. The function for the model in this example in provided in File2.

Lecture Video

Leave a Reply

Your email address will not be published.