<a href="https://colab.research.google.com/github/wcj365/plotly-express/blob/main/docs/chapter_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 11 - Linear Regression: Development vs Culture

This chapter we use visualizations to understand Linear Regression.

We assume a linear relationship between variable X and Y in the form of:

 `y = 2 * x + 1`

 This means each time x increases by one unit, y increases by two units. This is a positive linear relationship and can be visualized as a straight upward line. We consider this equation represents the underlying theory about X and Y.

However, in reality, when we measure X and Y (for example, a person's height and weight), the instruments we use may not have 100% precision, and our measuring and reading may not be 100% accurate. This inprecision and inaccuracy lead to erroroneous data collected. 








In [None]:
import numpy as np
import plotly.graph_objects as go

# Generate Sample Data
First, let's use Numpy to generate some random samples for X and
calculate Y based on the X using the equation;


In [None]:
x= np.random.randint(low=1, high=10, size=100)   # random integer between 1 and 10 (10 is not included)

y = x * 2 + 1

print(x[:10])      # print the first 10 x's
print(y[:10])      # print the last 10 y's

[5 9 3 8 9 1 8 4 9 5]
[11 19  7 17 19  3 17  9 19 11]


## Visualize the sample data

This shows a straight line since we did not account for any measurement errors.

In [None]:
fig = go.Figure()

trace_0 = {
    "x":x,
    "y":y,
    "type":"scatter",
    "mode":"markers+lines",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X"},
    yaxis={"title":"Y"}
)

fig.show()

## Add errors to the equation

We use Numpy's normal distribution to generate randon errors. We assume the mean (location) of errors is 0 and the standard deviation (scale) is 1.

In [None]:
error = np.random.normal(loc=0, scale=1, size=100)

y = x * 2 + 1 + error

print(x[:10])      # print the first 10 x's
print(y[:10])      # print the last 10 y's

[5 9 3 8 9 1 8 4 9 5]
[11.54283446 20.45395529  6.64919017 15.88312417 19.86741109  2.80419728
 16.83831521 10.34320749 18.60596667  8.07504056]


## Visualize the samepl data

Now, we no longer see a straight line. There are quite variations of Y for any given X.

In [None]:
fig = go.Figure()

trace_0 = {
    "x":x,
    "y":y,
    "type":"scatter",
    "mode":"markers",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X"},
    yaxis={"title":"Y"}
)

fig.show()

## Add the line to the chart

In [None]:
trace_1 = {
    "x":[0,10],
    "y":[1,21],
    "type":"scatter",
    "mode":"lines",
    "name":"Theoretical"
}

fig.add_trace(trace_1)

fig.show()

## Plot the distributio of errors

In [None]:
fig_error = go.Figure()

trace_0 = {
    "x":error,
    "type":"histogram"
}

fig_error.add_trace(trace_0)

fig_error.show()

## Let's add another independent variable

In [None]:
x= np.random.randint(low=1, high=10, size=100)   # random integer between 1 and 10 (10 is not included)

x2= np.random.randint(low=-5, high=6, size=100)   # random integer between -5 and 5 (6 is not included)

y = 2 * x + 1 * x2 + 10

print(x[:10])      # print the first 10 x's
print(x2[:10])      # print the first 10 x's
print(y[:10])      # print the last 10 y's

[5 7 2 2 3 3 7 2 6 5]
[-3  3 -1 -4  4  4  5 -3  3  2]
[17 27 13 10 20 20 29 11 25 22]


## Visualize x and y

In [None]:
fig = go.Figure()

trace_0 = {
    "x":x,
    "y":y,
    "type":"scatter",
    "mode":"markers",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X"},
    yaxis={"title":"Y"}
)

fig.show()

## Visualize x2 and y
We can see the slope of the trend between x2 and y is not as steep as that of x and y. This is due to the coefficient of x2 (1) is smaller than that of x (2).

In [None]:
fig = go.Figure()

trace_0 = {
    "x":x2,
    "y":y,
    "type":"scatter",
    "mode":"markers",
    "name":"Observed"
}

fig.add_trace(trace_0)

fig.update_layout(
    title="Liner Regression",
    xaxis={"title":"X2"},
    yaxis={"title":"Y"}
)

fig.show()

## Perfrom OLS regression analysis

In [None]:
import statsmodels.api as sm

model = sm.OLS(y,sm.add_constant(x))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.710
Model:                            OLS   Adj. R-squared:                  0.707
Method:                 Least Squares   F-statistic:                     239.8
Date:                Tue, 31 Aug 2021   Prob (F-statistic):           4.37e-28
Time:                        11:30:25   Log-Likelihood:                -261.74
No. Observations:                 100   AIC:                             527.5
Df Residuals:                      98   BIC:                             532.7
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          9.8962      0.717     13.807      0.0

In [None]:
import statsmodels.api as sm

model = sm.OLS(y,sm.add_constant(x2))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.357
Method:                 Least Squares   F-statistic:                     56.01
Date:                Tue, 31 Aug 2021   Prob (F-statistic):           3.15e-11
Time:                        11:30:53   Log-Likelihood:                -301.02
No. Observations:                 100   AIC:                             606.0
Df Residuals:                      98   BIC:                             611.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         19.2747      0.499     38.597      0.0

In [None]:
import statsmodels.api as sm

model = sm.OLS(y,sm.add_constant(list(zip(x,x2))))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.532e+30
Date:                Tue, 31 Aug 2021   Prob (F-statistic):               0.00
Time:                        11:26:53   Log-Likelihood:                 2982.6
No. Observations:                 100   AIC:                            -5959.
Df Residuals:                      97   BIC:                            -5951.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.0000   5.86e-15   1.71e+15      0.0