{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "zm7ji3Gcb7yz" }, "source": [ "# Chapter 11 - Linear Regression: Development vs Culture\n", "\n", "This chapter we use visualizations to understand Linear Regression.\n", "\n", "We assume a linear relationship between variable X and Y in the form of:\n", "\n", " `y = 2 * x + 1`\n", "\n", " This means each time x increases by one unit, y increases by two units. This is a positive linear relationship and can be visualized as a straight upward line. We consider this equation represents the underlying theory about X and Y.\n", "\n", "However, in reality, when we measure X and Y (for example, a person's height and weight), the instruments we use may not have 100% precision, and our measuring and reading may not be 100% accurate. This inprecision and inaccuracy lead to erroroneous data collected. \n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gBFexiv2RCWo" }, "outputs": [], "source": [ "import numpy as np\n", "import plotly.graph_objects as go" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LHLp9l03gjqv" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "K2o6dI5QdWaz" }, "source": [ "# Generate Sample Data\n", "First, let's use Numpy to generate some random samples for X and\n", "calculate Y based on the X using the equation;\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tvboUxVMUQZU", "outputId": "dddc4ccb-0310-41a0-eead-a20fcde6429c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[5 9 3 8 9 1 8 4 9 5]\n", "[11 19 7 17 19 3 17 9 19 11]\n" ] } ], "source": [ "x= np.random.randint(low=1, high=10, size=100) # random integer between 1 and 10 (10 is not included)\n", "\n", "y = x * 2 + 1\n", "\n", "print(x[:10]) # print the first 10 x's\n", "print(y[:10]) # print the last 10 y's" ] }, { "cell_type": "markdown", "metadata": { "id": "RI4vXarshITW" }, "source": [ "## Visualize the sample data\n", "\n", "This shows a straight line since we did not account for any measurement errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7H10sqvcUbX_", "outputId": "42fdef27-9ad4-4d30-ab0b-e3ac0d502c38" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = go.Figure()\n", "\n", "trace_0 = {\n", " \"x\":x,\n", " \"y\":y,\n", " \"type\":\"scatter\",\n", " \"mode\":\"markers+lines\",\n", " \"name\":\"Observed\"\n", "}\n", "\n", "fig.add_trace(trace_0)\n", "\n", "fig.update_layout(\n", " title=\"Liner Regression\",\n", " xaxis={\"title\":\"X\"},\n", " yaxis={\"title\":\"Y\"}\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "N6rVc-HohmJx" }, "source": [ "## Add errors to the equation\n", "\n", "We use Numpy's normal distribution to generate randon errors. We assume the mean (location) of errors is 0 and the standard deviation (scale) is 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XBwSsue3hvYr", "outputId": "1f0bf95b-03a4-4349-d5d9-1a657fd8e691" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[5 9 3 8 9 1 8 4 9 5]\n", "[11.54283446 20.45395529 6.64919017 15.88312417 19.86741109 2.80419728\n", " 16.83831521 10.34320749 18.60596667 8.07504056]\n" ] } ], "source": [ "error = np.random.normal(loc=0, scale=1, size=100)\n", "\n", "y = x * 2 + 1 + error\n", "\n", "print(x[:10]) # print the first 10 x's\n", "print(y[:10]) # print the last 10 y's" ] }, { "cell_type": "markdown", "metadata": { "id": "7U5zLFNLiowK" }, "source": [ "## Visualize the samepl data\n", "\n", "Now, we no longer see a straight line. There are quite variations of Y for any given X." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3_gArKb8ifiq", "outputId": "72d5bc0a-900c-4f1b-84a6-3226f57aa5dc" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = go.Figure()\n", "\n", "trace_0 = {\n", " \"x\":x,\n", " \"y\":y,\n", " \"type\":\"scatter\",\n", " \"mode\":\"markers\",\n", " \"name\":\"Observed\"\n", "}\n", "\n", "fig.add_trace(trace_0)\n", "\n", "fig.update_layout(\n", " title=\"Liner Regression\",\n", " xaxis={\"title\":\"X\"},\n", " yaxis={\"title\":\"Y\"}\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "27zS7JRXjCTW" }, "source": [ "## Add the line to the chart" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_Ibob4gfXPr5", "outputId": "d3755575-be71-4cb8-f743-947111b37243" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "trace_1 = {\n", " \"x\":[0,10],\n", " \"y\":[1,21],\n", " \"type\":\"scatter\",\n", " \"mode\":\"lines\",\n", " \"name\":\"Theoretical\"\n", "}\n", "\n", "fig.add_trace(trace_1)\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "n9YgXRqvjQOa" }, "source": [ "## Plot the distributio of errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "h48eO3NgaFP8", "outputId": "e1b1cc00-a711-449a-80e6-002958b356ff" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig_error = go.Figure()\n", "\n", "trace_0 = {\n", " \"x\":error,\n", " \"type\":\"histogram\"\n", "}\n", "\n", "fig_error.add_trace(trace_0)\n", "\n", "fig_error.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "REnrH_Me2OM8" }, "source": [ "## Let's add another independent variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BPQbPsJj8mun", "outputId": "f7c733e5-3c71-4ed9-c851-08f27fd1cb6d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[5 7 2 2 3 3 7 2 6 5]\n", "[-3 3 -1 -4 4 4 5 -3 3 2]\n", "[17 27 13 10 20 20 29 11 25 22]\n" ] } ], "source": [ "x= np.random.randint(low=1, high=10, size=100) # random integer between 1 and 10 (10 is not included)\n", "\n", "x2= np.random.randint(low=-5, high=6, size=100) # random integer between -5 and 5 (6 is not included)\n", "\n", "y = 2 * x + 1 * x2 + 10\n", "\n", "print(x[:10]) # print the first 10 x's\n", "print(x2[:10]) # print the first 10 x's\n", "print(y[:10]) # print the last 10 y's" ] }, { "cell_type": "markdown", "metadata": { "id": "1pF8rf9-83YG" }, "source": [ "## Visualize x and y" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-iGMH9K73KU3", "outputId": "2b7df5d5-b2df-44d4-9d71-3164bd8dcb07" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = go.Figure()\n", "\n", "trace_0 = {\n", " \"x\":x,\n", " \"y\":y,\n", " \"type\":\"scatter\",\n", " \"mode\":\"markers\",\n", " \"name\":\"Observed\"\n", "}\n", "\n", "fig.add_trace(trace_0)\n", "\n", "fig.update_layout(\n", " title=\"Liner Regression\",\n", " xaxis={\"title\":\"X\"},\n", " yaxis={\"title\":\"Y\"}\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "pUtkzIB_9EAu" }, "source": [ "## Visualize x2 and y\n", "We can see the slope of the trend between x2 and y is not as steep as that of x and y. This is due to the coefficient of x2 (1) is smaller than that of x (2)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "rWXfY_tH8_tl", "outputId": "7617b945-a3b6-48e0-93a0-0499b578b022" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = go.Figure()\n", "\n", "trace_0 = {\n", " \"x\":x2,\n", " \"y\":y,\n", " \"type\":\"scatter\",\n", " \"mode\":\"markers\",\n", " \"name\":\"Observed\"\n", "}\n", "\n", "fig.add_trace(trace_0)\n", "\n", "fig.update_layout(\n", " title=\"Liner Regression\",\n", " xaxis={\"title\":\"X2\"},\n", " yaxis={\"title\":\"Y\"}\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "5q50EIeO-LIG" }, "source": [ "## Perfrom OLS regression analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eFVvgalx-Rzw", "outputId": "1704b23d-d354-451a-a488-10afb560b75d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.710\n", "Model: OLS Adj. R-squared: 0.707\n", "Method: Least Squares F-statistic: 239.8\n", "Date: Tue, 31 Aug 2021 Prob (F-statistic): 4.37e-28\n", "Time: 11:30:25 Log-Likelihood: -261.74\n", "No. Observations: 100 AIC: 527.5\n", "Df Residuals: 98 BIC: 532.7\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 9.8962 0.717 13.807 0.000 8.474 11.319\n", "x1 2.1060 0.136 15.487 0.000 1.836 2.376\n", "==============================================================================\n", "Omnibus: 67.994 Durbin-Watson: 1.936\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.562\n", "Skew: -0.260 Prob(JB): 0.0138\n", "Kurtosis: 1.664 Cond. No. 11.6\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" ] } ], "source": [ "import statsmodels.api as sm\n", "\n", "model = sm.OLS(y,sm.add_constant(x))\n", "results = model.fit()\n", "print(results.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TA2EsJOl-bV8", "outputId": "6949858b-db7f-44d0-c1d7-b4af51e9e7a9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.364\n", "Model: OLS Adj. R-squared: 0.357\n", "Method: Least Squares F-statistic: 56.01\n", "Date: Tue, 31 Aug 2021 Prob (F-statistic): 3.15e-11\n", "Time: 11:30:53 Log-Likelihood: -301.02\n", "No. Observations: 100 AIC: 606.0\n", "Df Residuals: 98 BIC: 611.3\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 19.2747 0.499 38.597 0.000 18.284 20.266\n", "x1 1.1162 0.149 7.484 0.000 0.820 1.412\n", "==============================================================================\n", "Omnibus: 37.529 Durbin-Watson: 2.178\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.339\n", "Skew: 0.080 Prob(JB): 0.0420\n", "Kurtosis: 1.777 Cond. No. 3.38\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" ] } ], "source": [ "import statsmodels.api as sm\n", "\n", "model = sm.OLS(y,sm.add_constant(x2))\n", "results = model.fit()\n", "print(results.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "22pLlCXt3yPJ", "outputId": "97b7bfed-fcfe-4ee7-b718-c6e9ca089f03" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 1.000\n", "Model: OLS Adj. R-squared: 1.000\n", "Method: Least Squares F-statistic: 2.532e+30\n", "Date: Tue, 31 Aug 2021 Prob (F-statistic): 0.00\n", "Time: 11:26:53 Log-Likelihood: 2982.6\n", "No. Observations: 100 AIC: -5959.\n", "Df Residuals: 97 BIC: -5951.\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 10.0000 5.86e-15 1.71e+15 0.000 10.000 10.000\n", "x1 2.0000 1.11e-15 1.8e+15 0.000 2.000 2.000\n", "x2 1.0000 8.25e-16 1.21e+15 0.000 1.000 1.000\n", "==============================================================================\n", "Omnibus: 24.013 Durbin-Watson: 1.215\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.431\n", "Skew: 0.100 Prob(JB): 0.0662\n", "Kurtosis: 1.876 Cond. No. 11.7\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" ] } ], "source": [ "import statsmodels.api as sm\n", "\n", "model = sm.OLS(y,sm.add_constant(list(zip(x,x2))))\n", "results = model.fit()\n", "print(results.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RUtUZLOg6db2" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "ARrAkE6z2hdJ" }, "source": [] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyM5PxWhfsKqg9rFzqNWKAVg", "include_colab_link": true, "name": "chapter_02.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }