Chapter 11 - Linear Regression: Development vs Culture¶

This chapter we use visualizations to understand Linear Regression.

We assume a linear relationship between variable X and Y in the form of:

y = 2 * x + 1

This means each time x increases by one unit, y increases by two units. This is a positive linear relationship and can be visualized as a straight upward line. We consider this equation represents the underlying theory about X and Y.

However, in reality, when we measure X and Y (for example, a person’s height and weight), the instruments we use may not have 100% precision, and our measuring and reading may not be 100% accurate. This inprecision and inaccuracy lead to erroroneous data collected.

import numpy as np
import plotly.graph_objects as go

Generate Sample Data¶

First, let’s use Numpy to generate some random samples for X and calculate Y based on the X using the equation;

x= np.random.randint(low=1, high=10, size=100)   # random integer between 1 and 10 (10 is not included)

y = x * 2 + 1

print(x[:10])      # print the first 10 x's
print(y[:10])      # print the last 10 y's

[8 7 1 4 7 9 2 7 5 7]
[17 15  3  9 15 19  5 15 11 15]

Visualize the sample data¶

This shows a straight line since we did not account for any measurement errors.

fig = go.Figure()

trace_0 = {
    "x":x,
    "y":y,
    "type":"scatter",
    "mode":"markers+lines",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X"},
    yaxis={"title":"Y"}
)

fig.show()

Add errors to the equation¶

We use Numpy’s normal distribution to generate randon errors. We assume the mean (location) of errors is 0 and the standard deviation (scale) is 1.

error = np.random.normal(loc=0, scale=1, size=100)

y = x * 2 + 1 + error

print(x[:10])      # print the first 10 x's
print(y[:10])      # print the last 10 y's

[8 7 1 4 7 9 2 7 5 7]
[18.23333014 15.54704374  2.29521628  7.21014728 15.38158294 19.13404636
  4.93348762 14.89073756 10.60686126 15.3214106 ]

Visualize the samepl data¶

Now, we no longer see a straight line. There are quite variations of Y for any given X.

fig = go.Figure()

trace_0 = {
    "x":x,
    "y":y,
    "type":"scatter",
    "mode":"markers",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X"},
    yaxis={"title":"Y"}
)

fig.show()

Add the line to the chart¶

trace_1 = {
    "x":[0,10],
    "y":[1,21],
    "type":"scatter",
    "mode":"lines",
    "name":"Theoretical"
}

fig.add_trace(trace_1)

fig.show()

Plot the distributio of errors¶

fig_error = go.Figure()

trace_0 = {
    "x":error,
    "type":"histogram"
}

fig_error.add_trace(trace_0)

fig_error.show()

Let’s add another independent variable¶

x= np.random.randint(low=1, high=10, size=100)   # random integer between 1 and 10 (10 is not included)

x2= np.random.randint(low=-5, high=6, size=100)   # random integer between -5 and 5 (6 is not included)

y = 2 * x + 1 * x2 + 10

print(x[:10])      # print the first 10 x's
print(x2[:10])      # print the first 10 x's
print(y[:10])      # print the last 10 y's

[2 5 4 8 9 8 6 8 2 3]
[ 4 -1  2 -3  1 -4 -1  5  2  5]
[18 19 20 23 29 22 21 31 16 21]

Visualize x and y¶

fig = go.Figure()

trace_0 = {
    "x":x,
    "y":y,
    "type":"scatter",
    "mode":"markers",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X"},
    yaxis={"title":"Y"}
)

fig.show()

Visualize x2 and y¶

We can see the slope of the trend between x2 and y is not as steep as that of x and y. This is due to the coefficient of x2 (1) is smaller than that of x (2).

fig = go.Figure()

trace_0 = {
    "x":x2,
    "y":y,
    "type":"scatter",
    "mode":"markers",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X2"},
    yaxis={"title":"Y"}
)

fig.show()

Perfrom OLS regression analysis¶

import statsmodels.api as sm

model = sm.OLS(y,sm.add_constant(x))
results = model.fit()
print(results.summary())

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_19425/3276802959.py in <module>
----> 1 import statsmodels.api as sm
      2 
      3 model = sm.OLS(y,sm.add_constant(x))
      4 results = model.fit()
      5 print(results.summary())

ModuleNotFoundError: No module named 'statsmodels'

import statsmodels.api as sm

model = sm.OLS(y,sm.add_constant(x2))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.357
Method:                 Least Squares   F-statistic:                     56.01
Date:                Tue, 31 Aug 2021   Prob (F-statistic):           3.15e-11
Time:                        11:30:53   Log-Likelihood:                -301.02
No. Observations:                 100   AIC:                             606.0
Df Residuals:                      98   BIC:                             611.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         19.2747      0.499     38.597      0.000      18.284      20.266
x1             1.1162      0.149      7.484      0.000       0.820       1.412
==============================================================================
Omnibus:                       37.529   Durbin-Watson:                   2.178
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.339
Skew:                           0.080   Prob(JB):                       0.0420
Kurtosis:                       1.777   Cond. No.                         3.38
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

import statsmodels.api as sm

model = sm.OLS(y,sm.add_constant(list(zip(x,x2))))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.532e+30
Date:                Tue, 31 Aug 2021   Prob (F-statistic):               0.00
Time:                        11:26:53   Log-Likelihood:                 2982.6
No. Observations:                 100   AIC:                            -5959.
Df Residuals:                      97   BIC:                            -5951.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.0000   5.86e-15   1.71e+15      0.000      10.000      10.000
x1             2.0000   1.11e-15    1.8e+15      0.000       2.000       2.000
x2             1.0000   8.25e-16   1.21e+15      0.000       1.000       1.000
==============================================================================
Omnibus:                       24.013   Durbin-Watson:                   1.215
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.431
Skew:                           0.100   Prob(JB):                       0.0662
Kurtosis:                       1.876   Cond. No.                         11.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Chapter 8 - Tie Up Loose Ends

Appendix C - Online Resources

Data Visualization with Plotly Express