A visitor to New York City asked a passerby for directions to the city’s famous classical music venue:
Visitor: Excuse me, how do I get to Carnegie Hall?
Passer by: Practice, practice, practice!
Course Information
- Summary
    - Statistical Analysis, Data Visualization, and Python Programming all-in-one with Hands-on Practices.
 
- Online Class WebEx
- About the instructor
- Textbook
- Development Environment
Tutorials
Run Python on the Cloud
- Use Interactive Python Shell (Interpreter)
    - python.org
- No registration is required
 
- Run Python Scripts through Command Line
    - pythonanywhere.com
- Requires account registration
- Need basic Linux knowledge
 
- Use Jupyter Notebooks
    - Google Colab
        - Plus: Seemless integration with GitHub
 
- deepnote.com
        - Plus: Acess to Linux terminal to run Python Scripts
 
- Kaggle.com
        - Plus: Data science community with public datasets and notebooks for learning
 
 
- Google Colab
        
Four Levels of Python Code
https://m.facebook.com/story.php?story_fbid=2596321403939440&id=1815765878661667
- Syntax (most basic programming requirements)
- Idiom (use of .join for string concatenation)
- Design Patterns (best practices and approaches to common problems and issues)
- Architectural (Overall project structure)
Most books and courses teach level 1 and 2 and rarely touch on level 3 and 4.
Four Paradigms of Python Programming
https://blog.newrelic.com/engineering/python-programming-styles/
- Imperative
- Procedural
- Object-oriented
- Functional
Your Responsibility
- Written Code - Solely your responsibility - Make sure it is clean, correct, and commented (3C rule)
- Source Data - Primary data is your responsibity. You have no control over secondary data so be careful in the selection and cleansing.
- Existing Libraries - You have no control on existing libraries/algothorithms so be careful in selecting and using them.
- Interpretation of Results - Be careful about what is objective and what is subjective and what data exhibit and what experts know.
The one thing you have absolute control is the code you write. Make sure don’t write bad code (complicated, incorrect, and undocumented code), so-called spaghetti code.
Wikipedia’s Definition of Spaghetti Code:
“Spaghetti code is a pejorative phrase for unstructured and difficult-to-maintain source code. Spaghetti code can be caused by several factors, such as volatile project requirements, lack of programming style rules, and insufficient ability or experience.”
Jupyter Notebooks
The name Jupyter comes from the fact that it supports writing code in three popular languages:
- Julia
- Python
- R
Julia and R are popular for statistical analysis and data science. Python is a more generic programming language that happens to be popular in data science as well, though Python is good for all kinds of development, not just data science.
Dataviz Six Steps
- Define Problem and Ask Questions
- Define Data Source and Elements
- Tidy up Data (Normalize “messy” data so that is is “Tidy”. )
- Summarize Data (Summarize/Tablulate, descriptive statistics)
- Visualize Data (static and interactive)
- Interpret and Communicate Results
Check out this paper for data tidying.
All Six steps must be guided by domain knowledge, principles, and purposes.
Dataviz - Plots/Charts
- Univariate
    - Categorical Variable
        - Frequency table
- Bar chart (x=Categories, y=Count)
- Pareto Chart (sorted + accumulated %)
- Pie Chart (Avoid it when there are t0o many categories)
 
- Numerical Variable (discrete or continuous)
        - Histogram - frequency distribution
- Boxplot - five-number summary statistics (centrality and dispersion)
- Line Chart - trend over time
- Area Chart - Trend over time
 
- Textual Variable/Data
        - Wordcloud
 
 
- Categorical Variable
        
- Multivariate
    - Two categorical variables
        - Contingency table, pivot table
- Stacked Bar Chart (one bar on top of the other)
- Grouped Bar Chart (one bar next to each other)
 
- Two numerical variables (correlation)
        - 2D Scatter Plot
- 3D scatter plot
- Bubble Chart (Scatter plot with varying size of dots based on the third numerical variable)
- Motion Chart (Scatter plot with time frame for playback)
- Scatter Plot with varying colors and Shapes of marks reflecting additional categorical variables (dimensions)
- Line chart with multiple lines differentiated by color
 
- One Numerical and one categorical variable
        - Bar chart (x=categories of the categorical variable, y=statistics of the numerical variable)
- Statistics include mean, min, max, median, …
 
 
- Two categorical variables
        
References
- learnpython.org
- Scipy Lecture Notes
- Data Analysis and Visualization with Python for Social Scientists
- Python for Social Science
- Interactive Python Tutorial
- W3C School Python Tutorial
- Practical Data Science for Journalists and Everyone Else
- Markdown Cheatsheet
- Github Flavored Markdown
- AP Statistics Tutorial
- Practice Python
- Python Exercises, Practice, Solution
- 4-hour Beginner’s Python for Data Science Training Video
- Free Book: Python Data Science Handbook by Jake Vanderplas
- A Visual Intro to NumPy and Data Representation
- 10 Minutes to Pandas
- A Gentle Visual Intro to Data Analysis in Python Using Pandas
- Summarising, Aggregating, and Grouping data in Python Pandas
- Visualizing Pandas’ Pivoting and Reshaping Functions
- Data Visualization with Python
- From Data to Viz
- Data Visualization Best Practices
- Introduction to Exploratory Data Analysis in Python.ipynb
- Object-Oriented Programming in Python vs Java
- Ask, Acquire, Analyze, Apply, Announce, Assess
- A First Course in Data Science