Open In Colab

Chapter 5 - Express to Many Destinations

Plotly Express provide tens of pre-built chart types including histogram, bar chart, boxplot, line chart, scatter plot, geomap, choropleth, and many more.

All Plotly Express charts use methods that are very similar and easy to understand:

  • Take a Pandas.Dataframe object as the input for data source.

  • Use X or Y variable along with a column name in the data frame to indicate the data values for X axis or Y axis.

  • Use additional variables to represent various visualization elements including title, color theme, width, hight, label for X and Y, and many more.

In this chapter, we will try various built-in charts in Plotly Express and appreciate its simplicity and versatility.

5.1 Prepare Data

Plotly comes with several build-in sample datasets in the form of Pandas Dataframe. We will use the gapminder dataset. The gapminder dataset contains population, GDP per Capita, and Life Expectancy of countries from the past many years starting from 1952 until 2007 with five-year interval.

# Upgrade Plotly library since Google Colab has an older version of Plotly installed.

!pip install --upgrade plotly
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: plotly in /home/codespace/.local/lib/python3.8/site-packages (5.3.1)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from plotly) (1.14.0)
Requirement already satisfied: tenacity>=6.2.0 in /home/codespace/.local/lib/python3.8/site-packages (from plotly) (8.0.1)
# To use Plotly Express, import plotly.express module 
# Use px as an alias for easy reference later

import plotly.express as px
# Display the Plotly version number

import plotly

print(plotly.__version__)
5.3.1
df = px.data.gapminder()

df.head(15)       # display the first 15 rows
country continent year lifeExp pop gdpPercap iso_alpha iso_num
0 Afghanistan Asia 1952 28.801 8425333 779.445314 AFG 4
1 Afghanistan Asia 1957 30.332 9240934 820.853030 AFG 4
2 Afghanistan Asia 1962 31.997 10267083 853.100710 AFG 4
3 Afghanistan Asia 1967 34.020 11537966 836.197138 AFG 4
4 Afghanistan Asia 1972 36.088 13079460 739.981106 AFG 4
5 Afghanistan Asia 1977 38.438 14880372 786.113360 AFG 4
6 Afghanistan Asia 1982 39.854 12881816 978.011439 AFG 4
7 Afghanistan Asia 1987 40.822 13867957 852.395945 AFG 4
8 Afghanistan Asia 1992 41.674 16317921 649.341395 AFG 4
9 Afghanistan Asia 1997 41.763 22227415 635.341351 AFG 4
10 Afghanistan Asia 2002 42.129 25268405 726.734055 AFG 4
11 Afghanistan Asia 2007 43.828 31889923 974.580338 AFG 4
12 Albania Europe 1952 55.230 1282697 1601.056136 ALB 8
13 Albania Europe 1957 59.280 1476505 1942.284244 ALB 8
14 Albania Europe 1962 64.820 1728137 2312.888958 ALB 8
# Display the metadata information about the dataset:
# Number of rows, number of columns, columns names and types, etc.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
 6   iso_alpha  1704 non-null   object 
 7   iso_num    1704 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 106.6+ KB
# Display summary statistics, also known as descriptive statistics.

df.describe(include="all")
country continent year lifeExp pop gdpPercap iso_alpha iso_num
count 1704 1704 1704.00000 1704.000000 1.704000e+03 1704.000000 1704 1704.000000
unique 142 5 NaN NaN NaN NaN 141 NaN
top Afghanistan Africa NaN NaN NaN NaN KOR NaN
freq 12 624 NaN NaN NaN NaN 24 NaN
mean NaN NaN 1979.50000 59.474439 2.960121e+07 7215.327081 NaN 425.880282
std NaN NaN 17.26533 12.917107 1.061579e+08 9857.454543 NaN 248.305709
min NaN NaN 1952.00000 23.599000 6.001100e+04 241.165876 NaN 4.000000
25% NaN NaN 1965.75000 48.198000 2.793664e+06 1202.060309 NaN 208.000000
50% NaN NaN 1979.50000 60.712500 7.023596e+06 3531.846989 NaN 410.000000
75% NaN NaN 1993.25000 70.845500 1.958522e+07 9325.462346 NaN 638.000000
max NaN NaN 2007.00000 82.603000 1.318683e+09 113523.132900 NaN 894.000000
# Find out the unique years

df["year"].unique()
array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])
# Find out the countries

df["country"].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
       'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
       'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
       'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
       'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
       'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
       'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
       'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
       'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.',
       'Zambia', 'Zimbabwe'], dtype=object)
# Find out unique continents

df["continent"].unique()
array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)
# Only use data for United States

df_usa = df.query("iso_alpha == 'USA'")
df_usa
country continent year lifeExp pop gdpPercap iso_alpha iso_num
1608 United States Americas 1952 68.440 157553000 13990.48208 USA 840
1609 United States Americas 1957 69.490 171984000 14847.12712 USA 840
1610 United States Americas 1962 70.210 186538000 16173.14586 USA 840
1611 United States Americas 1967 70.760 198712000 19530.36557 USA 840
1612 United States Americas 1972 71.340 209896000 21806.03594 USA 840
1613 United States Americas 1977 73.380 220239000 24072.63213 USA 840
1614 United States Americas 1982 74.650 232187835 25009.55914 USA 840
1615 United States Americas 1987 75.020 242803533 29884.35041 USA 840
1616 United States Americas 1992 76.090 256894189 32003.93224 USA 840
1617 United States Americas 1997 76.810 272911760 35767.43303 USA 840
1618 United States Americas 2002 77.310 287675526 39097.09955 USA 840
1619 United States Americas 2007 78.242 301139947 42951.65309 USA 840

5.2 Create a Histogram

A Histogram helps visualize the distribution of a numerical variable ibcluding its centrality, dispersion, and skewness.

We will create a historgram of life expectanc to see how it is distributed globally. To do so, we pick a single year, 2007.

fig = px.histogram(df.query("year == 2007"), x="lifeExp")

fig.show()

The visualization is interactive so we can mouse over to see more details. If we mouse over the blue area between 0 and 40, it will display that there is just one country with life expectancy below 40 year old.

The above histogram shows more countries have life expectancy between 70 and 80 years and the distribution is skewed to the right.

The code is very simple with just one line:

fig = px.histogram(df.query("year == 2007"), x="lifeExp")

Here, We provide a Pandas Data frame df.query("year == 2007") and a column in the data frame lifeExp for input parameter x. To make the code essier to read, we typically break the line into several lines with indentation:

fig = px.histogram(
    df.query("year == 2007"),
    x="lifeExp"
)

Tere are many more input parameters available for further customizations. For example, this visualization can also be improved by adding a title via the parameter title.

# Break a single line into multiple lines for readability

fig = px.histogram(
    df.query("year == 2007"),
    x="lifeExp",
    title="The Global Distribution of Life Expectancy in 2007"
)

fig.show()

5.3 Create a Boxplot

According to the Plotly Express documentation (https://plotly.com/python/plotly-express/):

Plotly Express provides more than 30 functions for creating different types of figures. The API for these functions was carefully designed to be as consistent and easy to learn as possible, making it easy to switch from a scatter plot to a bar chart to a histogram to a sunburst chart throughout a data exploration session.

So, if we want to create a boxplot, we can simply replace the function name histogram with box and change the input parameter to y.

# A vertical boxplot

fig = px.box(
    df.query("year == 2007"),
    y="lifeExp",
    title="The Summary Statistics of Life Expectancy in 2007"
)

fig.show()

The reason for changing the parameter to y is because the default orientation for a boxplot is vertical.

If we want a horizontal boxplot, we would use input paramter x instead of y and set the input parameter orientation to h. Here h stands for horizontal and v stands for vertical.

# A horizonal boxplot

fig = px.box(
    df.query("year == 2007"),
    x="lifeExp",
    orientation='h',
    title="The Summary Statistics of Life Expectancy in 2007"
)

fig.show()

5.4 Comparing Two Years

We can compare changes in life expectancy between 1952 and 2007 by creating a visualization with two histograms or two boxplots.

# Compare two years using Histogram

fig = px.histogram(
    df.query("year == 1952 or year == 2007"),
    x="lifeExp",
    color="year",
    title="Comparing Life Expectancy between 1952 and 2007"
)

fig.show()

Here, we inlcude data for both year 1952 and 2007. We assign column name year to the input parameter color so that the visualization will use different colors to differentiate the two years.

We can see the histogram has shifted from left to right between 1952 (color blue) and 2007 (color red) indicating the life expectancy of people in the world has increased. Specifically, in 1952, no country has life expectancy above 75 years old and more than a dozen countries have life expectancy below 35. But in 2007, there are many countries with life expectancy above 75 and no country has life expectancy below 35.

The can do the same for the boxplot.

# Compare two years using Boxplot

fig = px.box(
    df.query("year == 1952 or year == 2007"),
    y="lifeExp",
    color="year",
    title="Comparison of Life Expectancy between 1995 and 2007"
)

fig.show()

If we mouse over, we will see the median life expectancy for year 1952 was 45 years old and for year 2007 was 72 years old. People have lived a lot longer after half a century.

4.5 Scatter Plot

A scatter plot vsualizes the relationship between two variables. Here we plot the GDP per Capita vs Life Expectancy to see if there are coorelated.

# Compare two years using Boxplot

fig = px.scatter(
    df.query("year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

The scatter plot indicates that countries with higher GDP per Capita tend to have higher life expectancy even thought the relationship does not appear to be linear.

When we mouse over a certain point, we see the value of both the X and Y variable, namely, gdpPercap and lifeExp. This does not tell us which country the point represents.

To see the country namem we can set the parameter hover_name to the column country. This will add country name to the hover data.

# Compare two years using Boxplot

fig = px.scatter(
    df.query("year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

We can have the scatter plot display the country code by setting the parameter text to the column iso_alpha.

# Scatter plot to show relationship between health and wealth

fig = px.scatter(
    df.query("year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    text="iso_alpha",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

The plot becomes too crowded with the label. If we limit number of countries, it will be more readable. For example, we only display countries from a certain continent.

# Scatter plot to show relationship between health and wealth

fig = px.scatter(
    df.query("continent == 'Americas' and year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    text="iso_alpha",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

This plot looks better. However, the country code overlaps with the marker.

Plotly Express’s scatter function does not provide a parameter to change the location of the text. We will need to resort to the core functions of Plotly to address this. This is achieved by invoking the update_traces() of the figure object and setting the textposition parameter.

In general, we start with Plotly Express of build-in plotttling functions and then use Plotly’s core functions to provide additional customizations that Plotly Express does not provide.

# Scatter plot to show relationship between health and wealth

fig = px.scatter(
    df.query("continent == 'Americas' and year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    text="iso_alpha",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.update_traces(textposition='top center')
fig.show()

4.6 Line Chart

Here we want to look at just one country and see how its life expectancy has changed over time. We will choose the United States as an example. We use the column iso_alpha to filter the data.

# Line chart

fig = px.line(
    data_frame=df.query("iso_alpha == 'USA'"),   
    x="year", 
    y="lifeExp",
    title="Life Expectancy of the United State between 1952 and 2007"
)

fig.show()

There are several issues with this chart:

  • The Y axis does not start with zero. This makes the increasing trend of life expectancy over time more dramatic

  • The X axis ticks do not coorespond to the years in the data

  • There are no markers to show the years in the data.

To address these issue, we can resort to the core functions of Plotly. Plotly comes functions to allow for flexible customizations of visualizations.

The following lines of code address these issues.

fig.update_xaxes(type='category')
fig.update_yaxes(rangemode="tozero")
fig.update_traces(mode='markers+lines')

fig.show()

Plotly Express is simple and powerful but the simplicity comes with limitations. We will illustrate this using its line chart function.

df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]

fig = px.line(df_usa_chn_ind, x="year", y="lifeExp", color="iso_alpha")

fig.update_traces(mode='markers+lines')
fig.update_xaxes(type='category', title="Year")
fig.update_yaxes(rangemode="tozero", title="Life Expectancy")
fig.update_layout(title="Comparison of Life Expectancy between US, China, and India")
fig.show()
df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]

fig = go.Figure()

for iso_alpha in df_usa_chn_ind["iso_alpha"].unique():
    df_temp = df_usa_chn_ind.query(f"iso_alpha == '{iso_alpha}'")
    trace = go.Scatter(x=df_temp["year"], y=df_temp["lifeExp"],  mode='markers+lines', name=iso_alpha)
    fig.add_trace(trace)

# fig.update_traces(mode='markers+lines')
fig.update_xaxes(type='category', title="Year")
fig.update_yaxes(rangemode="tozero", title="Life Expectancy")
fig.update_layout(title="Comparison of Life Expectancy between US, China, and India")
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_19130/1517036275.py in <module>
      1 df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]
      2 
----> 3 fig = go.Figure()
      4 
      5 for iso_alpha in df_usa_chn_ind["iso_alpha"].unique():

NameError: name 'go' is not defined
df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]

fig = px.bar(
    df_usa_chn_ind, 
    x="year", 
    y="lifeExp", 
    text="lifeExp",
    color="iso_alpha"
)

fig.update_xaxes(type='category')
fig.update_yaxes(rangemode="tozero")
#fig.update_traces(mode='markers+lines')
# Change the bar mode
# fig.update_layout(barmode='group')
fig.update_layout(barmode='stack', xaxis_tickangle=-45)
fig.show()