Chapter 5 - Express to Many Destinations¶
Plotly Express provide tens of pre-built chart types including histogram, bar chart, boxplot, line chart, scatter plot, geomap, choropleth, and many more.
All Plotly Express charts use methods that are very similar and easy to understand:
Take a
Pandas.Dataframe
object as the input for data source.Use X or Y variable along with a column name in the data frame to indicate the data values for X axis or Y axis.
Use additional variables to represent various visualization elements including title, color theme, width, hight, label for X and Y, and many more.
In this chapter, we will try various built-in charts in Plotly Express and appreciate its simplicity and versatility.
5.1 Prepare Data¶
Plotly comes with several build-in sample datasets in the form of Pandas Dataframe. We will use the gapminder dataset. The gapminder dataset contains population, GDP per Capita, and Life Expectancy of countries from the past many years starting from 1952 until 2007 with five-year interval.
# Upgrade Plotly library since Google Colab has an older version of Plotly installed.
!pip install --upgrade plotly
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: plotly in /home/codespace/.local/lib/python3.8/site-packages (5.3.1)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from plotly) (1.14.0)
Requirement already satisfied: tenacity>=6.2.0 in /home/codespace/.local/lib/python3.8/site-packages (from plotly) (8.0.1)
# To use Plotly Express, import plotly.express module
# Use px as an alias for easy reference later
import plotly.express as px
# Display the Plotly version number
import plotly
print(plotly.__version__)
5.3.1
df = px.data.gapminder()
df.head(15) # display the first 15 rows
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | AFG | 4 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 | AFG | 4 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 | AFG | 4 |
5 | Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.113360 | AFG | 4 |
6 | Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.011439 | AFG | 4 |
7 | Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.395945 | AFG | 4 |
8 | Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.341395 | AFG | 4 |
9 | Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.341351 | AFG | 4 |
10 | Afghanistan | Asia | 2002 | 42.129 | 25268405 | 726.734055 | AFG | 4 |
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
12 | Albania | Europe | 1952 | 55.230 | 1282697 | 1601.056136 | ALB | 8 |
13 | Albania | Europe | 1957 | 59.280 | 1476505 | 1942.284244 | ALB | 8 |
14 | Albania | Europe | 1962 | 64.820 | 1728137 | 2312.888958 | ALB | 8 |
# Display the metadata information about the dataset:
# Number of rows, number of columns, columns names and types, etc.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 1704 non-null object
1 continent 1704 non-null object
2 year 1704 non-null int64
3 lifeExp 1704 non-null float64
4 pop 1704 non-null int64
5 gdpPercap 1704 non-null float64
6 iso_alpha 1704 non-null object
7 iso_num 1704 non-null int64
dtypes: float64(2), int64(3), object(3)
memory usage: 106.6+ KB
# Display summary statistics, also known as descriptive statistics.
df.describe(include="all")
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
count | 1704 | 1704 | 1704.00000 | 1704.000000 | 1.704000e+03 | 1704.000000 | 1704 | 1704.000000 |
unique | 142 | 5 | NaN | NaN | NaN | NaN | 141 | NaN |
top | Afghanistan | Africa | NaN | NaN | NaN | NaN | KOR | NaN |
freq | 12 | 624 | NaN | NaN | NaN | NaN | 24 | NaN |
mean | NaN | NaN | 1979.50000 | 59.474439 | 2.960121e+07 | 7215.327081 | NaN | 425.880282 |
std | NaN | NaN | 17.26533 | 12.917107 | 1.061579e+08 | 9857.454543 | NaN | 248.305709 |
min | NaN | NaN | 1952.00000 | 23.599000 | 6.001100e+04 | 241.165876 | NaN | 4.000000 |
25% | NaN | NaN | 1965.75000 | 48.198000 | 2.793664e+06 | 1202.060309 | NaN | 208.000000 |
50% | NaN | NaN | 1979.50000 | 60.712500 | 7.023596e+06 | 3531.846989 | NaN | 410.000000 |
75% | NaN | NaN | 1993.25000 | 70.845500 | 1.958522e+07 | 9325.462346 | NaN | 638.000000 |
max | NaN | NaN | 2007.00000 | 82.603000 | 1.318683e+09 | 113523.132900 | NaN | 894.000000 |
# Find out the unique years
df["year"].unique()
array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
2007])
# Find out the countries
df["country"].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
'Korea, Rep.', 'Kuwait', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Mauritania',
'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco',
'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands',
'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Norway', 'Oman',
'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Philippines', 'Poland',
'Portugal', 'Puerto Rico', 'Reunion', 'Romania', 'Rwanda',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
'Sierra Leone', 'Singapore', 'Slovak Republic', 'Slovenia',
'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
'Tanzania', 'Thailand', 'Togo', 'Trinidad and Tobago', 'Tunisia',
'Turkey', 'Uganda', 'United Kingdom', 'United States', 'Uruguay',
'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Yemen, Rep.',
'Zambia', 'Zimbabwe'], dtype=object)
# Find out unique continents
df["continent"].unique()
array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)
# Only use data for United States
df_usa = df.query("iso_alpha == 'USA'")
df_usa
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
1608 | United States | Americas | 1952 | 68.440 | 157553000 | 13990.48208 | USA | 840 |
1609 | United States | Americas | 1957 | 69.490 | 171984000 | 14847.12712 | USA | 840 |
1610 | United States | Americas | 1962 | 70.210 | 186538000 | 16173.14586 | USA | 840 |
1611 | United States | Americas | 1967 | 70.760 | 198712000 | 19530.36557 | USA | 840 |
1612 | United States | Americas | 1972 | 71.340 | 209896000 | 21806.03594 | USA | 840 |
1613 | United States | Americas | 1977 | 73.380 | 220239000 | 24072.63213 | USA | 840 |
1614 | United States | Americas | 1982 | 74.650 | 232187835 | 25009.55914 | USA | 840 |
1615 | United States | Americas | 1987 | 75.020 | 242803533 | 29884.35041 | USA | 840 |
1616 | United States | Americas | 1992 | 76.090 | 256894189 | 32003.93224 | USA | 840 |
1617 | United States | Americas | 1997 | 76.810 | 272911760 | 35767.43303 | USA | 840 |
1618 | United States | Americas | 2002 | 77.310 | 287675526 | 39097.09955 | USA | 840 |
1619 | United States | Americas | 2007 | 78.242 | 301139947 | 42951.65309 | USA | 840 |
5.2 Create a Histogram¶
A Histogram helps visualize the distribution of a numerical variable ibcluding its centrality, dispersion, and skewness.
We will create a historgram of life expectanc to see how it is distributed globally. To do so, we pick a single year, 2007.
fig = px.histogram(df.query("year == 2007"), x="lifeExp")
fig.show()
The visualization is interactive so we can mouse over to see more details. If we mouse over the blue area between 0 and 40, it will display that there is just one country with life expectancy below 40 year old.
The above histogram shows more countries have life expectancy between 70 and 80 years and the distribution is skewed to the right.
The code is very simple with just one line:
fig = px.histogram(df.query("year == 2007"), x="lifeExp")
Here, We provide a Pandas Data frame df.query("year == 2007")
and a column in the data frame lifeExp
for input parameter x
. To make the code essier to read, we typically break the line into several lines with indentation:
fig = px.histogram(
df.query("year == 2007"),
x="lifeExp"
)
Tere are many more input parameters available for further customizations. For example, this visualization can also be improved by adding a title via the parameter title
.
# Break a single line into multiple lines for readability
fig = px.histogram(
df.query("year == 2007"),
x="lifeExp",
title="The Global Distribution of Life Expectancy in 2007"
)
fig.show()
5.3 Create a Boxplot¶
According to the Plotly Express documentation (https://plotly.com/python/plotly-express/):
Plotly Express provides more than 30 functions for creating different types of figures. The API for these functions was carefully designed to be as consistent and easy to learn as possible, making it easy to switch from a scatter plot to a bar chart to a histogram to a sunburst chart throughout a data exploration session.
So, if we want to create a boxplot, we can simply replace the function name histogram
with box
and change the input parameter to y
.
# A vertical boxplot
fig = px.box(
df.query("year == 2007"),
y="lifeExp",
title="The Summary Statistics of Life Expectancy in 2007"
)
fig.show()
The reason for changing the parameter to y
is because the default orientation for a boxplot is vertical.
If we want a horizontal boxplot, we would use input paramter x
instead of y
and set the input parameter orientation
to h
. Here h
stands for horizontal and v
stands for vertical.
# A horizonal boxplot
fig = px.box(
df.query("year == 2007"),
x="lifeExp",
orientation='h',
title="The Summary Statistics of Life Expectancy in 2007"
)
fig.show()
5.4 Comparing Two Years¶
We can compare changes in life expectancy between 1952 and 2007 by creating a visualization with two histograms or two boxplots.
# Compare two years using Histogram
fig = px.histogram(
df.query("year == 1952 or year == 2007"),
x="lifeExp",
color="year",
title="Comparing Life Expectancy between 1952 and 2007"
)
fig.show()
Here, we inlcude data for both year 1952 and 2007. We assign column name year
to the input parameter color
so that the visualization will use different colors to differentiate the two years.
We can see the histogram has shifted from left to right between 1952 (color blue) and 2007 (color red) indicating the life expectancy of people in the world has increased. Specifically, in 1952, no country has life expectancy above 75 years old and more than a dozen countries have life expectancy below 35. But in 2007, there are many countries with life expectancy above 75 and no country has life expectancy below 35.
The can do the same for the boxplot.
# Compare two years using Boxplot
fig = px.box(
df.query("year == 1952 or year == 2007"),
y="lifeExp",
color="year",
title="Comparison of Life Expectancy between 1995 and 2007"
)
fig.show()
If we mouse over, we will see the median life expectancy for year 1952 was 45 years old and for year 2007 was 72 years old. People have lived a lot longer after half a century.
4.5 Scatter Plot¶
A scatter plot vsualizes the relationship between two variables. Here we plot the GDP per Capita vs Life Expectancy to see if there are coorelated.
# Compare two years using Boxplot
fig = px.scatter(
df.query("year == 2007"),
x="gdpPercap",
y="lifeExp",
title="Life Expectancy vs GDP per Capita 2007"
)
fig.show()
The scatter plot indicates that countries with higher GDP per Capita tend to have higher life expectancy even thought the relationship does not appear to be linear.
When we mouse over a certain point, we see the value of both the X and Y variable, namely, gdpPercap and lifeExp. This does not tell us which country the point represents.
To see the country namem we can set the parameter hover_name
to the column country
. This will add country name to the hover data.
# Compare two years using Boxplot
fig = px.scatter(
df.query("year == 2007"),
x="gdpPercap",
y="lifeExp",
hover_name="country",
title="Life Expectancy vs GDP per Capita 2007"
)
fig.show()
We can have the scatter plot display the country code by setting the parameter text
to the column iso_alpha
.
# Scatter plot to show relationship between health and wealth
fig = px.scatter(
df.query("year == 2007"),
x="gdpPercap",
y="lifeExp",
text="iso_alpha",
hover_name="country",
title="Life Expectancy vs GDP per Capita 2007"
)
fig.show()
The plot becomes too crowded with the label. If we limit number of countries, it will be more readable. For example, we only display countries from a certain continent.
# Scatter plot to show relationship between health and wealth
fig = px.scatter(
df.query("continent == 'Americas' and year == 2007"),
x="gdpPercap",
y="lifeExp",
text="iso_alpha",
hover_name="country",
title="Life Expectancy vs GDP per Capita 2007"
)
fig.show()
This plot looks better. However, the country code overlaps with the marker.
Plotly Express’s scatter function does not provide a parameter to change the location of the text. We will need to resort to the core functions of Plotly to address this. This is achieved by invoking the update_traces()
of the figure object and setting the textposition
parameter.
In general, we start with Plotly Express of build-in plotttling functions and then use Plotly’s core functions to provide additional customizations that Plotly Express does not provide.
# Scatter plot to show relationship between health and wealth
fig = px.scatter(
df.query("continent == 'Americas' and year == 2007"),
x="gdpPercap",
y="lifeExp",
text="iso_alpha",
hover_name="country",
title="Life Expectancy vs GDP per Capita 2007"
)
fig.update_traces(textposition='top center')
fig.show()
4.6 Line Chart¶
Here we want to look at just one country and see how its life expectancy has changed over time. We will choose the United States as an example. We use the column iso_alpha
to filter the data.
# Line chart
fig = px.line(
data_frame=df.query("iso_alpha == 'USA'"),
x="year",
y="lifeExp",
title="Life Expectancy of the United State between 1952 and 2007"
)
fig.show()
There are several issues with this chart:
The Y axis does not start with zero. This makes the increasing trend of life expectancy over time more dramatic
The X axis ticks do not coorespond to the years in the data
There are no markers to show the years in the data.
To address these issue, we can resort to the core functions of Plotly. Plotly comes functions to allow for flexible customizations of visualizations.
The following lines of code address these issues.
fig.update_xaxes(type='category')
fig.update_yaxes(rangemode="tozero")
fig.update_traces(mode='markers+lines')
fig.show()
Plotly Express is simple and powerful but the simplicity comes with limitations. We will illustrate this using its line chart function.
df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]
fig = px.line(df_usa_chn_ind, x="year", y="lifeExp", color="iso_alpha")
fig.update_traces(mode='markers+lines')
fig.update_xaxes(type='category', title="Year")
fig.update_yaxes(rangemode="tozero", title="Life Expectancy")
fig.update_layout(title="Comparison of Life Expectancy between US, China, and India")
fig.show()
df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]
fig = go.Figure()
for iso_alpha in df_usa_chn_ind["iso_alpha"].unique():
df_temp = df_usa_chn_ind.query(f"iso_alpha == '{iso_alpha}'")
trace = go.Scatter(x=df_temp["year"], y=df_temp["lifeExp"], mode='markers+lines', name=iso_alpha)
fig.add_trace(trace)
# fig.update_traces(mode='markers+lines')
fig.update_xaxes(type='category', title="Year")
fig.update_yaxes(rangemode="tozero", title="Life Expectancy")
fig.update_layout(title="Comparison of Life Expectancy between US, China, and India")
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/tmp/ipykernel_19130/1517036275.py in <module>
1 df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]
2
----> 3 fig = go.Figure()
4
5 for iso_alpha in df_usa_chn_ind["iso_alpha"].unique():
NameError: name 'go' is not defined
df_usa_chn_ind = df[df["iso_alpha"].isin(["USA","CHN","IND"])]
fig = px.bar(
df_usa_chn_ind,
x="year",
y="lifeExp",
text="lifeExp",
color="iso_alpha"
)
fig.update_xaxes(type='category')
fig.update_yaxes(rangemode="tozero")
#fig.update_traces(mode='markers+lines')
# Change the bar mode
# fig.update_layout(barmode='group')
fig.update_layout(barmode='stack', xaxis_tickangle=-45)
fig.show()