Open In Colab

Chapter 2 - Data Visualization Concepts

This chapter introduces rudimentary concepts relvelant to data visualization in a brief manner without going into much details.

2.1 Data Type: Numerical vs Categorical

Data can be differentiated as Structured vs semi-structured vs unstructured. For this book, we will only deal with structured data.

Structured data can be boardly classfied into two types:

  • Numerical

    • Interval

    • Ratio

  • Categorical

    • Ordinal

    • Nominal

There is a special type of structured data that measure point in time. For example, year, month, week, day, hour, etc. This is called temporal data type. It can be represented as either a numerical data or a categorical data.

2.2 Data Format: Long vs Wide

Structured data are typically organized in two different formats depending on how a categorized numerical measure is represented.

In a long format, a categorial variable is used to store the categories and a numerical variable is used to store the measure. This results in more rows.

A wide format uses multiple numerical variables to represent the measure, each variable represents a single category.

Data in long format are also call tidy data. The process of tranforming data from wide format to long format is called tidying up the data. Data in tidy format tend to be more conducive to data analysis and visualization.

However, either one has its own advantages and disadvantages. Plotly Express works well with either format.

2.3 Visual Encoding: Marks & Channels

To brign data to life, data visualization employs marks and channels to visually encode data points.

A mark is a geometric shape that helps visualize the persona of a data point. Typical marks are:

  • Dot

  • Bar

  • Line

  • Area

  • Square

  • Triangle

A channel represents a visual property of a mark which enriches the persona of a data point. Typical channels are:

  • Size

  • Location

    • X Coordinate

    • Y Coordinate

  • Color

  • Opacity

  • Text Annotation