Chapter 5 - Express to Many Destinations¶

Plotly Express provide tens of pre-built chart types including histogram, bar chart, boxplot, line chart, scatter plot, geomap, choropleth, and many more.

All Plotly Express charts use methods that are very similar and easy to understand:

Take a Pandas.Dataframe object as the input for data source.
Use X or Y variable along with a column name in the data frame to indicate the data values for X axis or Y axis.
Use additional variables to represent various visualization elements including title, color theme, width, hight, label for X and Y, and many more.

In this chapter, we will try various built-in charts in Plotly Express and appreciate its simplicity and versatility.

5.1 Prepare Data¶

Plotly comes with several build-in sample datasets in the form of Pandas Dataframe. We will use the gapminder dataset. The gapminder dataset contains population, GDP per Capita, and Life Expectancy of countries from the past many years starting from 1952 until 2007 with five-year interval.

# Upgrade Plotly library since Google Colab has an older version of Plotly installed.

!pip install --upgrade plotly

Defaulting to user installation because normal site-packages is not writeable

Requirement already satisfied: plotly in /home/codespace/.local/lib/python3.8/site-packages (5.3.1)

Requirement already satisfied: six in /usr/lib/python3/dist-packages (from plotly) (1.14.0)
Requirement already satisfied: tenacity>=6.2.0 in /home/codespace/.local/lib/python3.8/site-packages (from plotly) (8.0.1)

# To use Plotly Express, import plotly.express module 
# Use px as an alias for easy reference later

import plotly.express as px

# Display the Plotly version number

import plotly

print(plotly.__version__)

5.3.1

df = px.data.gapminder()

df.head(15)       # display the first 15 rows

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
0	Afghanistan	Asia	1952	28.801	8425333	779.445314	AFG	4
1	Afghanistan	Asia	1957	30.332	9240934	820.853030	AFG	4
2	Afghanistan	Asia	1962	31.997	10267083	853.100710	AFG	4
3	Afghanistan	Asia	1967	34.020	11537966	836.197138	AFG	4
4	Afghanistan	Asia	1972	36.088	13079460	739.981106	AFG	4
5	Afghanistan	Asia	1977	38.438	14880372	786.113360	AFG	4
6	Afghanistan	Asia	1982	39.854	12881816	978.011439	AFG	4
7	Afghanistan	Asia	1987	40.822	13867957	852.395945	AFG	4
8	Afghanistan	Asia	1992	41.674	16317921	649.341395	AFG	4
9	Afghanistan	Asia	1997	41.763	22227415	635.341351	AFG	4
10	Afghanistan	Asia	2002	42.129	25268405	726.734055	AFG	4
11	Afghanistan	Asia	2007	43.828	31889923	974.580338	AFG	4
12	Albania	Europe	1952	55.230	1282697	1601.056136	ALB	8
13	Albania	Europe	1957	59.280	1476505	1942.284244	ALB	8
14	Albania	Europe	1962	64.820	1728137	2312.888958	ALB	8

# Display the metadata information about the dataset:
# Number of rows, number of columns, columns names and types, etc.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
 6   iso_alpha  1704 non-null   object 
 7   iso_num    1704 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 106.6+ KB

# Display summary statistics, also known as descriptive statistics.

df.describe(include="all")

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
count	1704	1704	1704.00000	1704.000000	1.704000e+03	1704.000000	1704	1704.000000
unique	142	5	NaN	NaN	NaN	NaN	141	NaN
top	Afghanistan	Africa	NaN	NaN	NaN	NaN	KOR	NaN
freq	12	624	NaN	NaN	NaN	NaN	24	NaN
mean	NaN	NaN	1979.50000	59.474439	2.960121e+07	7215.327081	NaN	425.880282
std	NaN	NaN	17.26533	12.917107	1.061579e+08	9857.454543	NaN	248.305709
min	NaN	NaN	1952.00000	23.599000	6.001100e+04	241.165876	NaN	4.000000
25%	NaN	NaN	1965.75000	48.198000	2.793664e+06	1202.060309	NaN	208.000000
50%	NaN	NaN	1979.50000	60.712500	7.023596e+06	3531.846989	NaN	410.000000
75%	NaN	NaN	1993.25000	70.845500	1.958522e+07	9325.462346	NaN	638.000000
max	NaN	NaN	2007.00000	82.603000	1.318683e+09	113523.132900	NaN	894.000000

# Find out the unique years

df["year"].unique()

array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])

# Find out the countries

df["country"].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
       'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
       'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
       'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
       'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
       'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
       'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
       'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
       'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
       'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
       'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
       'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.',
       'Zambia', 'Zimbabwe'], dtype=object)

# Find out unique continents

df["continent"].unique()

array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

# Only use data for United States

df_usa = df.query("iso_alpha == 'USA'")
df_usa

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
1608	United States	Americas	1952	68.440	157553000	13990.48208	USA	840
1609	United States	Americas	1957	69.490	171984000	14847.12712	USA	840
1610	United States	Americas	1962	70.210	186538000	16173.14586	USA	840
1611	United States	Americas	1967	70.760	198712000	19530.36557	USA	840
1612	United States	Americas	1972	71.340	209896000	21806.03594	USA	840
1613	United States	Americas	1977	73.380	220239000	24072.63213	USA	840
1614	United States	Americas	1982	74.650	232187835	25009.55914	USA	840
1615	United States	Americas	1987	75.020	242803533	29884.35041	USA	840
1616	United States	Americas	1992	76.090	256894189	32003.93224	USA	840
1617	United States	Americas	1997	76.810	272911760	35767.43303	USA	840
1618	United States	Americas	2002	77.310	287675526	39097.09955	USA	840
1619	United States	Americas	2007	78.242	301139947	42951.65309	USA	840

5.2 Create a Histogram¶

A Histogram helps visualize the distribution of a numerical variable ibcluding its centrality, dispersion, and skewness.

We will create a historgram of life expectanc to see how it is distributed globally. To do so, we pick a single year, 2007.

fig = px.histogram(df.query("year == 2007"), x="lifeExp")

fig.show()

The visualization is interactive so we can mouse over to see more details. If we mouse over the blue area between 0 and 40, it will display that there is just one country with life expectancy below 40 year old.

The above histogram shows more countries have life expectancy between 70 and 80 years and the distribution is skewed to the right.

The code is very simple with just one line:

fig = px.histogram(df.query("year == 2007"), x="lifeExp")

Here, We provide a Pandas Data frame df.query("year == 2007") and a column in the data frame lifeExp for input parameter x. To make the code essier to read, we typically break the line into several lines with indentation:

fig = px.histogram(
    df.query("year == 2007"),
    x="lifeExp"
)

Tere are many more input parameters available for further customizations. For example, this visualization can also be improved by adding a title via the parameter title.

# Break a single line into multiple lines for readability

fig = px.histogram(
    df.query("year == 2007"),
    x="lifeExp",
    title="The Global Distribution of Life Expectancy in 2007"
)

fig.show()

5.3 Create a Boxplot¶

According to the Plotly Express documentation (https://plotly.com/python/plotly-express/):

Plotly Express provides more than 30 functions for creating different types of figures. The API for these functions was carefully designed to be as consistent and easy to learn as possible, making it easy to switch from a scatter plot to a bar chart to a histogram to a sunburst chart throughout a data exploration session.

So, if we want to create a boxplot, we can simply replace the function name histogram with box and change the input parameter to y.

# A vertical boxplot

fig = px.box(
    df.query("year == 2007"),
    y="lifeExp",
    title="The Summary Statistics of Life Expectancy in 2007"
)

fig.show()

The reason for changing the parameter to y is because the default orientation for a boxplot is vertical.

If we want a horizontal boxplot, we would use input paramter x instead of y and set the input parameter orientation to h. Here h stands for horizontal and v stands for vertical.

# A horizonal boxplot

fig = px.box(
    df.query("year == 2007"),
    x="lifeExp",
    orientation='h',
    title="The Summary Statistics of Life Expectancy in 2007"
)

fig.show()

5.4 Comparing Two Years¶

We can compare changes in life expectancy between 1952 and 2007 by creating a visualization with two histograms or two boxplots.

# Compare two years using Histogram

fig = px.histogram(
    df.query("year == 1952 or year == 2007"),
    x="lifeExp",
    color="year",
    title="Comparing Life Expectancy between 1952 and 2007"
)

fig.show()

Here, we inlcude data for both year 1952 and 2007. We assign column name year to the input parameter color so that the visualization will use different colors to differentiate the two years.

We can see the histogram has shifted from left to right between 1952 (color blue) and 2007 (color red) indicating the life expectancy of people in the world has increased. Specifically, in 1952, no country has life expectancy above 75 years old and more than a dozen countries have life expectancy below 35. But in 2007, there are many countries with life expectancy above 75 and no country has life expectancy below 35.

The can do the same for the boxplot.

# Compare two years using Boxplot

fig = px.box(
    df.query("year == 1952 or year == 2007"),
    y="lifeExp",
    color="year",
    title="Comparison of Life Expectancy between 1995 and 2007"
)

fig.show()

If we mouse over, we will see the median life expectancy for year 1952 was 45 years old and for year 2007 was 72 years old. People have lived a lot longer after half a century.

4.5 Scatter Plot¶

A scatter plot vsualizes the relationship between two variables. Here we plot the GDP per Capita vs Life Expectancy to see if there are coorelated.

# Compare two years using Boxplot

fig = px.scatter(
    df.query("year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

The scatter plot indicates that countries with higher GDP per Capita tend to have higher life expectancy even thought the relationship does not appear to be linear.

When we mouse over a certain point, we see the value of both the X and Y variable, namely, gdpPercap and lifeExp. This does not tell us which country the point represents.

To see the country namem we can set the parameter hover_name to the column country. This will add country name to the hover data.

# Compare two years using Boxplot

fig = px.scatter(
    df.query("year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

We can have the scatter plot display the country code by setting the parameter text to the column iso_alpha.

# Scatter plot to show relationship between health and wealth

fig = px.scatter(
    df.query("year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    text="iso_alpha",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

The plot becomes too crowded with the label. If we limit number of countries, it will be more readable. For example, we only display countries from a certain continent.

# Scatter plot to show relationship between health and wealth

fig = px.scatter(
    df.query("continent == 'Americas' and year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    text="iso_alpha",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.show()

This plot looks better. However, the country code overlaps with the marker.

Plotly Express’s scatter function does not provide a parameter to change the location of the text. We will need to resort to the core functions of Plotly to address this. This is achieved by invoking the update_traces() of the figure object and setting the textposition parameter.

In general, we start with Plotly Express of build-in plotttling functions and then use Plotly’s core functions to provide additional customizations that Plotly Express does not provide.

# Scatter plot to show relationship between health and wealth

fig = px.scatter(
    df.query("continent == 'Americas' and year == 2007"),
    x="gdpPercap",
    y="lifeExp",
    text="iso_alpha",
    hover_name="country",
    title="Life Expectancy vs GDP per Capita 2007"
)

fig.update_traces(textposition='top center')
fig.show()

4.6 Line Chart¶

Here we want to look at just one country and see how its life expectancy has changed over time. We will choose the United States as an example. We use the column iso_alpha to filter the data.

# Line chart

fig = px.line(
    data_frame=df.query("iso_alpha == 'USA'"),   
    x="year", 
    y="lifeExp",
    title="Life Expectancy of the United State between 1952 and 2007"
)

fig.show()

There are several issues with this chart:

The Y axis does not start with zero. This makes the increasing trend of life expectancy over time more dramatic
The X axis ticks do not coorespond to the years in the data
There are no markers to show the years in the data.

To address these issue, we can resort to the core functions of Plotly. Plotly comes functions to allow for flexible customizations of visualizations.

The following lines of code address these issues.

fig.update_xaxes(type='category')
fig.update_yaxes(rangemode="tozero")
fig.update_traces(mode='markers+lines')

fig.show()

Plotly Express is simple and powerful but the simplicity comes with limitations. We will illustrate this using its line chart function.

df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]

fig = px.line(df_usa_chn_ind, x="year", y="lifeExp", color="iso_alpha")

fig.update_traces(mode='markers+lines')
fig.update_xaxes(type='category', title="Year")
fig.update_yaxes(rangemode="tozero", title="Life Expectancy")
fig.update_layout(title="Comparison of Life Expectancy between US, China, and India")
fig.show()

df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]

fig = go.Figure()

for iso_alpha in df_usa_chn_ind["iso_alpha"].unique():
    df_temp = df_usa_chn_ind.query(f"iso_alpha == '{iso_alpha}'")
    trace = go.Scatter(x=df_temp["year"], y=df_temp["lifeExp"],  mode='markers+lines', name=iso_alpha)
    fig.add_trace(trace)

# fig.update_traces(mode='markers+lines')
fig.update_xaxes(type='category', title="Year")
fig.update_yaxes(rangemode="tozero", title="Life Expectancy")
fig.update_layout(title="Comparison of Life Expectancy between US, China, and India")
fig.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_19130/1517036275.py in <module>
      1 df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]
      2 
----> 3 fig = go.Figure()
      4 
      5 for iso_alpha in df_usa_chn_ind["iso_alpha"].unique():

NameError: name 'go' is not defined

df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]

fig = px.bar(
    df_usa_chn_ind, 
    x="year", 
    y="lifeExp", 
    text="lifeExp",
    color="iso_alpha"
)

fig.update_xaxes(type='category')
fig.update_yaxes(rangemode="tozero")
#fig.update_traces(mode='markers+lines')
# Change the bar mode
# fig.update_layout(barmode='group')
fig.update_layout(barmode='stack', xaxis_tickangle=-45)
fig.show()

Data Visualization with Plotly Express