{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "_9JRextUSFeK" }, "source": [ "# Chapter 5 - Express to Many Destinations\n", "\n", "Plotly Express provide tens of pre-built chart types including histogram, bar chart, boxplot, line chart, scatter plot, geomap, choropleth, and many more. \n", "\n", "All Plotly Express charts use methods that are very similar and easy to understand:\n", "\n", "- Take a `Pandas.Dataframe` object as the input for data source. \n", "- Use X or Y variable along with a column name in the data frame to indicate the data values for X axis or Y axis.\n", "- Use additional variables to represent various visualization elements including title, color theme, width, hight, label for X and Y, and many more. \n", "\n", "In this chapter, we will try various built-in charts in Plotly Express and appreciate its simplicity and versatility.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "AFhVdPILUXiy" }, "source": [ "## 5.1 Prepare Data\n", "\n", "Plotly comes with several build-in sample datasets in the form of Pandas Dataframe. We will use the gapminder dataset. The gapminder dataset contains population, GDP per Capita, and Life Expectancy of countries from the past many years starting from 1952 until 2007 with five-year interval. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Fohl-aR1Szlp", "outputId": "9f325e06-cbd6-45ea-93c4-32d411e44a3a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (4.4.1)\n", "Collecting plotly\n", " Downloading plotly-5.3.1-py2.py3-none-any.whl (23.9 MB)\n", "\u001b[K |████████████████████████████████| 23.9 MB 1.4 MB/s \n", "\u001b[?25hRequirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from plotly) (1.15.0)\n", "Collecting tenacity>=6.2.0\n", " Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)\n", "Installing collected packages: tenacity, plotly\n", " Attempting uninstall: plotly\n", " Found existing installation: plotly 4.4.1\n", " Uninstalling plotly-4.4.1:\n", " Successfully uninstalled plotly-4.4.1\n", "Successfully installed plotly-5.3.1 tenacity-8.0.1\n" ] } ], "source": [ "# Upgrade Plotly library since Google Colab has an older version of Plotly installed.\n", "\n", "!pip install --upgrade plotly" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0N2PTQ7BY2FI" }, "outputs": [], "source": [ "# To use Plotly Express, import plotly.express module \n", "# Use px as an alias for easy reference later\n", "\n", "import plotly.express as px" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VgkwrVsoAERs", "outputId": "90b5eaf6-aea9-4677-90f2-7f301cd6dbeb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5.3.1\n" ] } ], "source": [ "# Display the Plotly version number\n", "\n", "import plotly\n", "\n", "print(plotly.__version__)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 519 }, "id": "a9fMXX-PAQuf", "outputId": "9470e70f-e123-4090-8b87-e0fb99873a34" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercapiso_alphaiso_num
0AfghanistanAsia195228.8018425333779.445314AFG4
1AfghanistanAsia195730.3329240934820.853030AFG4
2AfghanistanAsia196231.99710267083853.100710AFG4
3AfghanistanAsia196734.02011537966836.197138AFG4
4AfghanistanAsia197236.08813079460739.981106AFG4
5AfghanistanAsia197738.43814880372786.113360AFG4
6AfghanistanAsia198239.85412881816978.011439AFG4
7AfghanistanAsia198740.82213867957852.395945AFG4
8AfghanistanAsia199241.67416317921649.341395AFG4
9AfghanistanAsia199741.76322227415635.341351AFG4
10AfghanistanAsia200242.12925268405726.734055AFG4
11AfghanistanAsia200743.82831889923974.580338AFG4
12AlbaniaEurope195255.23012826971601.056136ALB8
13AlbaniaEurope195759.28014765051942.284244ALB8
14AlbaniaEurope196264.82017281372312.888958ALB8
\n", "
" ], "text/plain": [ " country continent year ... gdpPercap iso_alpha iso_num\n", "0 Afghanistan Asia 1952 ... 779.445314 AFG 4\n", "1 Afghanistan Asia 1957 ... 820.853030 AFG 4\n", "2 Afghanistan Asia 1962 ... 853.100710 AFG 4\n", "3 Afghanistan Asia 1967 ... 836.197138 AFG 4\n", "4 Afghanistan Asia 1972 ... 739.981106 AFG 4\n", "5 Afghanistan Asia 1977 ... 786.113360 AFG 4\n", "6 Afghanistan Asia 1982 ... 978.011439 AFG 4\n", "7 Afghanistan Asia 1987 ... 852.395945 AFG 4\n", "8 Afghanistan Asia 1992 ... 649.341395 AFG 4\n", "9 Afghanistan Asia 1997 ... 635.341351 AFG 4\n", "10 Afghanistan Asia 2002 ... 726.734055 AFG 4\n", "11 Afghanistan Asia 2007 ... 974.580338 AFG 4\n", "12 Albania Europe 1952 ... 1601.056136 ALB 8\n", "13 Albania Europe 1957 ... 1942.284244 ALB 8\n", "14 Albania Europe 1962 ... 2312.888958 ALB 8\n", "\n", "[15 rows x 8 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = px.data.gapminder()\n", "\n", "df.head(15) # display the first 15 rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "yc_21wenBFgO", "outputId": "823746ea-e5e8-4a62-c1d4-c985d369bb5c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1704 entries, 0 to 1703\n", "Data columns (total 8 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 country 1704 non-null object \n", " 1 continent 1704 non-null object \n", " 2 year 1704 non-null category\n", " 3 lifeExp 1704 non-null float64 \n", " 4 pop 1704 non-null int64 \n", " 5 gdpPercap 1704 non-null float64 \n", " 6 iso_alpha 1704 non-null object \n", " 7 iso_num 1704 non-null int64 \n", "dtypes: category(1), float64(2), int64(2), object(3)\n", "memory usage: 95.4+ KB\n" ] } ], "source": [ "# Display the metadata information about the dataset:\n", "# Number of rows, number of columns, columns names and types, etc.\n", "\n", "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 394 }, "id": "2HAn1BK3BImS", "outputId": "d4e04664-9d3e-49b9-d4fa-9acc98c1ce53" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercapiso_alphaiso_num
count170417041704.01704.0000001.704000e+031704.00000017041704.000000
unique142512.0NaNNaNNaN141NaN
topCosta RicaAfrica2007.0NaNNaNNaNKORNaN
freq12624142.0NaNNaNNaN24NaN
meanNaNNaNNaN59.4744392.960121e+077215.327081NaN425.880282
stdNaNNaNNaN12.9171071.061579e+089857.454543NaN248.305709
minNaNNaNNaN23.5990006.001100e+04241.165877NaN4.000000
25%NaNNaNNaN48.1980002.793664e+061202.060309NaN208.000000
50%NaNNaNNaN60.7125007.023596e+063531.846989NaN410.000000
75%NaNNaNNaN70.8455001.958522e+079325.462346NaN638.000000
maxNaNNaNNaN82.6030001.318683e+09113523.132900NaN894.000000
\n", "
" ], "text/plain": [ " country continent year ... gdpPercap iso_alpha iso_num\n", "count 1704 1704 1704.0 ... 1704.000000 1704 1704.000000\n", "unique 142 5 12.0 ... NaN 141 NaN\n", "top Costa Rica Africa 2007.0 ... NaN KOR NaN\n", "freq 12 624 142.0 ... NaN 24 NaN\n", "mean NaN NaN NaN ... 7215.327081 NaN 425.880282\n", "std NaN NaN NaN ... 9857.454543 NaN 248.305709\n", "min NaN NaN NaN ... 241.165877 NaN 4.000000\n", "25% NaN NaN NaN ... 1202.060309 NaN 208.000000\n", "50% NaN NaN NaN ... 3531.846989 NaN 410.000000\n", "75% NaN NaN NaN ... 9325.462346 NaN 638.000000\n", "max NaN NaN NaN ... 113523.132900 NaN 894.000000\n", "\n", "[11 rows x 8 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display summary statistics, also known as descriptive statistics.\n", "\n", "df.describe(include=\"all\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dZgZAJpKBsuC", "outputId": "03208674-817e-4afa-a9d5-fe9181da106a" }, "outputs": [ { "data": { "text/plain": [ "[1952, 1957, 1962, 1967, 1972, ..., 1987, 1992, 1997, 2002, 2007]\n", "Length: 12\n", "Categories (12, int64): [1952, 1957, 1962, 1967, ..., 1992, 1997, 2002, 2007]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find out the unique years\n", "\n", "df[\"year\"].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ToyZa3GSWcb6", "outputId": "d154a1b0-1039-42db-ed3d-a62cc4cf87c0" }, "outputs": [ { "data": { "text/plain": [ "array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',\n", " 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',\n", " 'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',\n", " 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',\n", " 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',\n", " 'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',\n", " 'Costa Rica', \"Cote d'Ivoire\", 'Croatia', 'Cuba', 'Czech Republic',\n", " 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',\n", " 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',\n", " 'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',\n", " 'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',\n", " 'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',\n", " 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',\n", " 'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',\n", " 'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',\n", " 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',\n", " 'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',\n", " 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',\n", " 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',\n", " 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',\n", " 'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',\n", " 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',\n", " 'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',\n", " 'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',\n", " 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',\n", " 'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',\n", " 'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',\n", " 'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.',\n", " 'Zambia', 'Zimbabwe'], dtype=object)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find out the countries\n", "\n", "df[\"country\"].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DDFpN8n2Qk2n", "outputId": "410bfea4-be1f-41c4-ad3a-cbfc9bbf8214" }, "outputs": [ { "data": { "text/plain": [ "array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find out unique continents\n", "\n", "df[\"continent\"].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 425 }, "id": "sZAlOykCCIba", "outputId": "b5d980a1-588a-4065-99a4-9dd2c43f4815" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercapiso_alphaiso_num
1608United StatesAmericas195268.44015755300013990.48208USA840
1609United StatesAmericas195769.49017198400014847.12712USA840
1610United StatesAmericas196270.21018653800016173.14586USA840
1611United StatesAmericas196770.76019871200019530.36557USA840
1612United StatesAmericas197271.34020989600021806.03594USA840
1613United StatesAmericas197773.38022023900024072.63213USA840
1614United StatesAmericas198274.65023218783525009.55914USA840
1615United StatesAmericas198775.02024280353329884.35041USA840
1616United StatesAmericas199276.09025689418932003.93224USA840
1617United StatesAmericas199776.81027291176035767.43303USA840
1618United StatesAmericas200277.31028767552639097.09955USA840
1619United StatesAmericas200778.24230113994742951.65309USA840
\n", "
" ], "text/plain": [ " country continent year ... gdpPercap iso_alpha iso_num\n", "1608 United States Americas 1952 ... 13990.48208 USA 840\n", "1609 United States Americas 1957 ... 14847.12712 USA 840\n", "1610 United States Americas 1962 ... 16173.14586 USA 840\n", "1611 United States Americas 1967 ... 19530.36557 USA 840\n", "1612 United States Americas 1972 ... 21806.03594 USA 840\n", "1613 United States Americas 1977 ... 24072.63213 USA 840\n", "1614 United States Americas 1982 ... 25009.55914 USA 840\n", "1615 United States Americas 1987 ... 29884.35041 USA 840\n", "1616 United States Americas 1992 ... 32003.93224 USA 840\n", "1617 United States Americas 1997 ... 35767.43303 USA 840\n", "1618 United States Americas 2002 ... 39097.09955 USA 840\n", "1619 United States Americas 2007 ... 42951.65309 USA 840\n", "\n", "[12 rows x 8 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Only use data for United States\n", "\n", "df_usa = df.query(\"iso_alpha == 'USA'\")\n", "df_usa" ] }, { "cell_type": "markdown", "metadata": { "id": "q8kQLXzkdYm0" }, "source": [ "## 5.2 Create a Histogram\n", "\n", "A Histogram helps visualize the distribution of a numerical variable ibcluding its centrality, dispersion, and skewness. \n", "\n", "We will create a historgram of life expectanc to see how it is distributed globally. To do so, we pick a single year, 2007. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "INdjqUTdcyFN", "outputId": "d3b6cb43-5e82-48e9-db71-30d475269469" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = px.histogram(df.query(\"year == 2007\"), x=\"lifeExp\")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "NHBHlRAkepav" }, "source": [ "The visualization is interactive so we can mouse over to see more details. If we mouse over the blue area between 0 and 40, it will display that there is just one country with life expectancy below 40 year old. \n", "\n", "The above histogram shows more countries have life expectancy between 70 and 80 years and the distribution is skewed to the right.\n", "\n", "The code is very simple with just one line:\n", "```\n", "fig = px.histogram(df.query(\"year == 2007\"), x=\"lifeExp\")\n", "```\n", "\n", "Here, We provide a Pandas Data frame `df.query(\"year == 2007\")` and a column in the data frame `lifeExp` for input parameter `x`. To make the code essier to read, we typically break the line into several lines with indentation:\n", "\n", "```\n", "fig = px.histogram(\n", " df.query(\"year == 2007\"),\n", " x=\"lifeExp\"\n", ")\n", "```\n", "\n", "Tere are many more input parameters available for further customizations. For example, this visualization can also be improved by adding a title via the parameter `title`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "BnIWAU6jjOu-", "outputId": "f8c6d263-5f7a-48a6-f106-66dc01cc9dab" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Break a single line into multiple lines for readability\n", "\n", "fig = px.histogram(\n", " df.query(\"year == 2007\"),\n", " x=\"lifeExp\",\n", " title=\"The Global Distribution of Life Expectancy in 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "-2SzJQzcnPqt" }, "source": [ "## 5.3 Create a Boxplot\n", "\n", "According to the Plotly Express documentation (https://plotly.com/python/plotly-express/):\n", "\n", "> Plotly Express provides more than 30 functions for creating different types of figures. The API for these functions was carefully designed to be as consistent and easy to learn as possible, making it easy to switch from a scatter plot to a bar chart to a histogram to a sunburst chart throughout a data exploration session.\n", "\n", "So, if we want to create a boxplot, we can simply replace the function name `histogram` with `box` and change the input parameter to `y`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "JBbSZHsQpREz", "outputId": "29c64480-ff41-45d0-da2b-9b4bea2d900a" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# A vertical boxplot\n", "\n", "fig = px.box(\n", " df.query(\"year == 2007\"),\n", " y=\"lifeExp\",\n", " title=\"The Summary Statistics of Life Expectancy in 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "nkYTY7m5qj2C" }, "source": [ "The reason for changing the parameter to `y` is because the default orientation for a boxplot is vertical. \n", "\n", "If we want a horizontal boxplot, we would use input paramter `x` instead of `y` and set the input parameter `orientation` to `h`. Here `h` stands for horizontal and `v` stands for vertical." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "_s2kcBwOrHIf", "outputId": "42cc0ea5-372f-42e3-aa34-b8bfc6eef956" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# A horizonal boxplot\n", "\n", "fig = px.box(\n", " df.query(\"year == 2007\"),\n", " x=\"lifeExp\",\n", " orientation='h',\n", " title=\"The Summary Statistics of Life Expectancy in 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "jFvHXe2faRgA" }, "source": [ "## 5.4 Comparing Two Years\n", "\n", "We can compare changes in life expectancy between 1952 and 2007 by creating a visualization with two histograms or two boxplots.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "r-O76vkiuZBv", "outputId": "6447bc40-83cf-4b6e-969f-5a370dfd4019" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Compare two years using Histogram\n", "\n", "fig = px.histogram(\n", " df.query(\"year == 1952 or year == 2007\"),\n", " x=\"lifeExp\",\n", " color=\"year\",\n", " title=\"Comparing Life Expectancy between 1952 and 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "0MlYwaG2u30C" }, "source": [ "Here, we inlcude data for both year 1952 and 2007. We assign column name `year` to the input parameter `color` so that the visualization will use different colors to differentiate the two years. \n", "\n", "We can see the histogram has shifted from left to right between 1952 (color blue) and 2007 (color red) indicating the life expectancy of people in the world has increased. \n", "Specifically, in 1952, no country has life expectancy above 75 years old and more than a dozen countries have life expectancy below 35. But in 2007, there are many countries with life expectancy above 75 and no country has life expectancy below 35.\n", "\n", "The can do the same for the boxplot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "IahCLBjVsbwZ", "outputId": "3ddc0345-ea2f-4ad7-cd46-8338e1282c33" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Compare two years using Boxplot\n", "\n", "fig = px.box(\n", " df.query(\"year == 1952 or year == 2007\"),\n", " y=\"lifeExp\",\n", " color=\"year\",\n", " title=\"Comparison of Life Expectancy between 1995 and 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "mVcITDlgxgfA" }, "source": [ "If we mouse over, we will see the median life expectancy for year 1952 was 45 years old and for year 2007 was 72 years old. People have lived a lot longer after half a century." ] }, { "cell_type": "markdown", "metadata": { "id": "kQB6FiDnLdkK" }, "source": [ "## 4.5 Scatter Plot\n", "\n", "A scatter plot vsualizes the relationship between two variables. Here we plot the GDP per Capita vs Life Expectancy to see if there are coorelated. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "jTNFWn4OM05k", "outputId": "503d41e1-6266-41d0-c478-79b93987f5d0" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Compare two years using Boxplot\n", "\n", "fig = px.scatter(\n", " df.query(\"year == 2007\"),\n", " x=\"gdpPercap\",\n", " y=\"lifeExp\",\n", " title=\"Life Expectancy vs GDP per Capita 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "P5u4aUjuNEfO" }, "source": [ "The scatter plot indicates that countries with higher GDP per Capita tend to have higher life expectancy even thought the relationship does not appear to be linear. \n", "\n", "When we mouse over a certain point, we see the value of both the X and Y variable, namely, gdpPercap and lifeExp. This does not tell us which country the point represents.\n", "\n", "To see the country namem we can set the parameter `hover_name` to the column `country`. This will add country name to the hover data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "QOhB_X0YOXG-", "outputId": "df547432-8147-43dc-d924-7e48146fd8bd" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Compare two years using Boxplot\n", "\n", "fig = px.scatter(\n", " df.query(\"year == 2007\"),\n", " x=\"gdpPercap\",\n", " y=\"lifeExp\",\n", " hover_name=\"country\",\n", " title=\"Life Expectancy vs GDP per Capita 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "3yqMToofOupd" }, "source": [ "We can have the scatter plot display the country code by setting the parameter `text` to the column `iso_alpha`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "BLIxRVr5O_MH", "outputId": "00142db4-d4f5-483a-efd7-7c17089c4d6c" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Scatter plot to show relationship between health and wealth\n", "\n", "fig = px.scatter(\n", " df.query(\"year == 2007\"),\n", " x=\"gdpPercap\",\n", " y=\"lifeExp\",\n", " text=\"iso_alpha\",\n", " hover_name=\"country\",\n", " title=\"Life Expectancy vs GDP per Capita 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "18d0CcSRPnbf" }, "source": [ "The plot becomes too crowded with the label. If we limit number of countries, it will be more readable. For example, we only display countries from a certain continent." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "7ESyRLmMQyvN", "outputId": "4da2028c-c19c-4699-ad21-7f78994773c5" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Scatter plot to show relationship between health and wealth\n", "\n", "fig = px.scatter(\n", " df.query(\"continent == 'Americas' and year == 2007\"),\n", " x=\"gdpPercap\",\n", " y=\"lifeExp\",\n", " text=\"iso_alpha\",\n", " hover_name=\"country\",\n", " title=\"Life Expectancy vs GDP per Capita 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "Qe1MiJ-nSf7h" }, "source": [ "This plot looks better. However, the country code overlaps with the marker. \n", "\n", "Plotly Express's scatter function does not provide a parameter to change the location of the text. We will need to resort to the core functions of Plotly to address this. This is achieved by invoking the `update_traces()` of the figure object and setting the `textposition` parameter. \n", "\n", "In general, we start with Plotly Express of build-in plotttling functions and then use Plotly's core functions to provide additional customizations that Plotly Express does not provide. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "pb2qrRLVTW7t", "outputId": "c5d770b3-8f1e-4944-fc68-0b4936503015" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Scatter plot to show relationship between health and wealth\n", "\n", "fig = px.scatter(\n", " df.query(\"continent == 'Americas' and year == 2007\"),\n", " x=\"gdpPercap\",\n", " y=\"lifeExp\",\n", " text=\"iso_alpha\",\n", " hover_name=\"country\",\n", " title=\"Life Expectancy vs GDP per Capita 2007\"\n", ")\n", "\n", "fig.update_traces(textposition='top center')\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "sQgMw1dizPSB" }, "source": [ "## 4.6 Line Chart\n", "\n", "Here we want to look at just one country and see how its life expectancy has changed over time. We will choose the United States as an example. We use the column `iso_alpha` to filter the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "Ir-7VznlWahQ", "outputId": "1b184f65-a965-4784-9c64-e96d12be5cdd" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Line chart\n", "\n", "fig = px.line(\n", " data_frame=df.query(\"iso_alpha == 'USA'\"), \n", " x=\"year\", \n", " y=\"lifeExp\",\n", " title=\"Life Expectancy of the United State between 1952 and 2007\"\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "cMRzzwrEWhhA" }, "source": [ "There are several issues with this chart:\n", "- The Y axis does not start with zero. This makes the increasing trend of life expectancy over time more dramatic\n", "- The X axis ticks do not coorespond to the years in the data\n", "- There are no markers to show the years in the data. \n", "\n", "To address these issue, we can resort to the core functions of Plotly. Plotly comes functions to allow for flexible customizations of visualizations. \n", "\n", "The following lines of code address these issues." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "t5wKwykeB0UA", "outputId": "aefeb80c-fe7e-4fc2-c0af-310ff017ddcf" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig.update_xaxes(type='category')\n", "fig.update_yaxes(rangemode=\"tozero\")\n", "fig.update_traces(mode='markers+lines')\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "AW3HrPJ6V7ps" }, "source": [ "Plotly Express is simple and powerful but the simplicity comes with limitations. We will illustrate this using its line chart function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 542 }, "id": "dW6X3qa6X5Pg", "outputId": "7c958e41-8ba7-49ed-8e5e-2f8af19f7844" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_usa_chn_ind = df[df[\"iso_alpha\"].isin([\"USA\",\"CHN\",\"IND\"])]\n", "\n", "fig = px.line(df_usa_chn_ind, x=\"year\", y=\"lifeExp\", color=\"iso_alpha\")\n", "\n", "fig.update_traces(mode='markers+lines')\n", "fig.update_xaxes(type='category', title=\"Year\")\n", "fig.update_yaxes(rangemode=\"tozero\", title=\"Life Expectancy\")\n", "fig.update_layout(title=\"Comparison of Life Expectancy between US, China, and India\")\n", "fig.show()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 235 }, "id": "XNiwGn08tptQ", "outputId": "d55a0c48-bcf9-4f86-e9ab-de2f8a390a22" }, "outputs": [ { "ename": "NameError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdf_usa_chn_ind\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"iso_alpha\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"USA\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\"CHN\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\"IND\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mfig\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgo\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mFigure\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0miso_alpha\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdf_usa_chn_ind\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"iso_alpha\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'go' is not defined" ] } ], "source": [ "df_usa_chn_ind = df[df[\"iso_alpha\"].isin([\"USA\",\"CHN\",\"IND\"])]\n", "\n", "fig = go.Figure()\n", "\n", "for iso_alpha in df_usa_chn_ind[\"iso_alpha\"].unique():\n", " df_temp = df_usa_chn_ind.query(f\"iso_alpha == '{iso_alpha}'\")\n", " trace = go.Scatter(x=df_temp[\"year\"], y=df_temp[\"lifeExp\"], mode='markers+lines', name=iso_alpha)\n", " fig.add_trace(trace)\n", "\n", "# fig.update_traces(mode='markers+lines')\n", "fig.update_xaxes(type='category', title=\"Year\")\n", "fig.update_yaxes(rangemode=\"tozero\", title=\"Life Expectancy\")\n", "fig.update_layout(title=\"Comparison of Life Expectancy between US, China, and India\")\n", "fig.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rwxWQtqu7Fx3" }, "outputs": [], "source": [ "df_usa_chn_ind = df[df[\"iso_alpha\"].isin([\"USA\",\"CHN\",\"IND\"])]\n", "\n", "fig = px.bar(\n", " df_usa_chn_ind, \n", " x=\"year\", \n", " y=\"lifeExp\", \n", " text=\"lifeExp\",\n", " color=\"iso_alpha\"\n", ")\n", "\n", "fig.update_xaxes(type='category')\n", "fig.update_yaxes(rangemode=\"tozero\")\n", "#fig.update_traces(mode='markers+lines')\n", "# Change the bar mode\n", "# fig.update_layout(barmode='group')\n", "fig.update_layout(barmode='stack', xaxis_tickangle=-45)\n", "fig.show()" ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyOHMEn2rIHvXTVntU5X+GgP", "include_colab_link": true, "name": "plotly_express.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }