Chapter 11 - Linear Regression: Development vs Culture¶
This chapter we use visualizations to understand Linear Regression.
We assume a linear relationship between variable X and Y in the form of:
y = 2 * x + 1
This means each time x increases by one unit, y increases by two units. This is a positive linear relationship and can be visualized as a straight upward line. We consider this equation represents the underlying theory about X and Y.
However, in reality, when we measure X and Y (for example, a person’s height and weight), the instruments we use may not have 100% precision, and our measuring and reading may not be 100% accurate. This inprecision and inaccuracy lead to erroroneous data collected.
import numpy as np
import plotly.graph_objects as go
Generate Sample Data¶
First, let’s use Numpy to generate some random samples for X and calculate Y based on the X using the equation;
x= np.random.randint(low=1, high=10, size=100) # random integer between 1 and 10 (10 is not included)
y = x * 2 + 1
print(x[:10]) # print the first 10 x's
print(y[:10]) # print the last 10 y's
[8 7 1 4 7 9 2 7 5 7]
[17 15 3 9 15 19 5 15 11 15]
Visualize the sample data¶
This shows a straight line since we did not account for any measurement errors.
fig = go.Figure()
trace_0 = {
"x":x,
"y":y,
"type":"scatter",
"mode":"markers+lines",
"name":"Observed"
}
fig.add_trace(trace_0)
fig.update_layout(
title="Liner Regression",
xaxis={"title":"X"},
yaxis={"title":"Y"}
)
fig.show()
Add errors to the equation¶
We use Numpy’s normal distribution to generate randon errors. We assume the mean (location) of errors is 0 and the standard deviation (scale) is 1.
error = np.random.normal(loc=0, scale=1, size=100)
y = x * 2 + 1 + error
print(x[:10]) # print the first 10 x's
print(y[:10]) # print the last 10 y's
[8 7 1 4 7 9 2 7 5 7]
[18.23333014 15.54704374 2.29521628 7.21014728 15.38158294 19.13404636
4.93348762 14.89073756 10.60686126 15.3214106 ]
Visualize the samepl data¶
Now, we no longer see a straight line. There are quite variations of Y for any given X.
fig = go.Figure()
trace_0 = {
"x":x,
"y":y,
"type":"scatter",
"mode":"markers",
"name":"Observed"
}
fig.add_trace(trace_0)
fig.update_layout(
title="Liner Regression",
xaxis={"title":"X"},
yaxis={"title":"Y"}
)
fig.show()
Add the line to the chart¶
trace_1 = {
"x":[0,10],
"y":[1,21],
"type":"scatter",
"mode":"lines",
"name":"Theoretical"
}
fig.add_trace(trace_1)
fig.show()
Plot the distributio of errors¶
fig_error = go.Figure()
trace_0 = {
"x":error,
"type":"histogram"
}
fig_error.add_trace(trace_0)
fig_error.show()
Let’s add another independent variable¶
x= np.random.randint(low=1, high=10, size=100) # random integer between 1 and 10 (10 is not included)
x2= np.random.randint(low=-5, high=6, size=100) # random integer between -5 and 5 (6 is not included)
y = 2 * x + 1 * x2 + 10
print(x[:10]) # print the first 10 x's
print(x2[:10]) # print the first 10 x's
print(y[:10]) # print the last 10 y's
[2 5 4 8 9 8 6 8 2 3]
[ 4 -1 2 -3 1 -4 -1 5 2 5]
[18 19 20 23 29 22 21 31 16 21]
Visualize x and y¶
fig = go.Figure()
trace_0 = {
"x":x,
"y":y,
"type":"scatter",
"mode":"markers",
"name":"Observed"
}
fig.add_trace(trace_0)
fig.update_layout(
title="Liner Regression",
xaxis={"title":"X"},
yaxis={"title":"Y"}
)
fig.show()
Visualize x2 and y¶
We can see the slope of the trend between x2 and y is not as steep as that of x and y. This is due to the coefficient of x2 (1) is smaller than that of x (2).
fig = go.Figure()
trace_0 = {
"x":x2,
"y":y,
"type":"scatter",
"mode":"markers",
"name":"Observed"
}
fig.add_trace(trace_0)
fig.update_layout(
title="Liner Regression",
xaxis={"title":"X2"},
yaxis={"title":"Y"}
)
fig.show()
Perfrom OLS regression analysis¶
import statsmodels.api as sm
model = sm.OLS(y,sm.add_constant(x))
results = model.fit()
print(results.summary())
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_19425/3276802959.py in <module>
----> 1 import statsmodels.api as sm
2
3 model = sm.OLS(y,sm.add_constant(x))
4 results = model.fit()
5 print(results.summary())
ModuleNotFoundError: No module named 'statsmodels'
import statsmodels.api as sm
model = sm.OLS(y,sm.add_constant(x2))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.364
Model: OLS Adj. R-squared: 0.357
Method: Least Squares F-statistic: 56.01
Date: Tue, 31 Aug 2021 Prob (F-statistic): 3.15e-11
Time: 11:30:53 Log-Likelihood: -301.02
No. Observations: 100 AIC: 606.0
Df Residuals: 98 BIC: 611.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 19.2747 0.499 38.597 0.000 18.284 20.266
x1 1.1162 0.149 7.484 0.000 0.820 1.412
==============================================================================
Omnibus: 37.529 Durbin-Watson: 2.178
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.339
Skew: 0.080 Prob(JB): 0.0420
Kurtosis: 1.777 Cond. No. 3.38
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
import statsmodels.api as sm
model = sm.OLS(y,sm.add_constant(list(zip(x,x2))))
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.532e+30
Date: Tue, 31 Aug 2021 Prob (F-statistic): 0.00
Time: 11:26:53 Log-Likelihood: 2982.6
No. Observations: 100 AIC: -5959.
Df Residuals: 97 BIC: -5951.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.0000 5.86e-15 1.71e+15 0.000 10.000 10.000
x1 2.0000 1.11e-15 1.8e+15 0.000 2.000 2.000
x2 1.0000 8.25e-16 1.21e+15 0.000 1.000 1.000
==============================================================================
Omnibus: 24.013 Durbin-Watson: 1.215
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.431
Skew: 0.100 Prob(JB): 0.0662
Kurtosis: 1.876 Cond. No. 11.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.