Week 2B: Data Visualization Fundamentals¶

Sep 10, 2020

Housekeeping¶

HW #1 due today
HW #2 has been posted — due in two weeks
Lots of good questions on Piazza so far!
- Email me if you need access: https://piazza.com/upenn/fall2020/musa550

Reminder: Links to course materials and main sites (Piazza, Canvas, Github) can be found on the home page of the main course website:

https://musa-550-fall-2020.github.io/

Help with file paths and working directories¶

Eugene put up a great post on Piazza walking through somes tips for managing your folder structure on your laptop:

https://piazza.com/class/ke999wuhrls2t8?cid=29

Reminder: Office Hours¶

Nick: Tuesdays 7:30am-9am and 6pm-7:30pm
Eugene: Thursday, 10:30am-12:30pm
Sign-up for time slots on Canvas calendar

Reminder: Course Schedule¶

There was no class on Tuesday — a pre-recorded lecture for 2A can be found on Canvas under "Class Recordings"
Moving forward, a pre-recorded lecture will be uploaded Sunday evenings and will replace the synchronous Zoom lecture on Tuesday

Week #2¶

Week #2 repository: https://github.com/MUSA-550-Fall-2020/week-2
Recommended readings for the week listed here
Last Tuesday
- A brief overview of data visualization
- Practical tips on color in data vizualization
- The Python landscape:
  - matplotlib
  - seaborn
Today
- Adding interaction to our plots!
- Intro to the Altair package
- Lab: Reproducing a famous Wall Street Journal data visualization with Altair

Reminder: following along with lectures¶

Easiest option: Binder¶

Screen%20Shot%202020-09-09%20at%208.39.24%20PM.png

Harder option: downloading Github repository contents¶

Screen%20Shot%202020-09-09%20at%208.42.03%20PM.png

Now onto some Python...¶

Let's load the Palmer Penguins data set¶

We'll use the Palmer penguins data set, data collected for three species of penguins at Palmer station in Antartica

Artwork by @allison_horst

import pandas as pd

# Load data on Palmer penguins
penguins = pd.read_csv("./data/penguins.csv")
penguins.head(n=10)

The altair import statement¶

import altair as alt

A visualization grammar¶

Specify what should be done
Details determined automatically
Charts are really just visualization specifications and the data to make the plot
Relies on vega and vega-lite

Important: focuses on tidy data — you'll often find yourself running pd.melt() to get to tidy format

Let's try out our flipper length vs bill length example from last lecture...

# initialize the chart with the data
chart = alt.Chart(penguins)

# define what kind of marks to use
chart = chart.mark_circle(size=60)

# encode the visual channels
chart = chart.encode(
    x="flipper_length_mm",
    y="bill_length_mm",
    color="species", 
    tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)

# make the chart interactive
chart.interactive()

Altair shorcuts¶

There are built-in objects to represent "x", "y", "color", "tooltip", etc..
Using the object syntax allows your to customize how different elements behave

Example: previous code is the same as

chart = chart.encode(
    x=alt.X("flipper_length_mm"),
    y=alt.Y("bill_length_mm"),
    color=alt.Color("species"),
    tooltip=alt.Tooltip(["species", "flipper_length_mm", "bill_length_mm", "island", "sex"]),
)

Changing Altair chart axis limits¶

By default, Altair assumes the axis will start at 0
To center on the data automatically, we need to use a alt.Scale() object to specify the scale

# initialize the chart with the data
chart = alt.Chart(penguins)

# define what kind of marks to use
chart = chart.mark_circle(size=60)

# encode the visual channels
chart = chart.encode(
    x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
    y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
    color="species",
    tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)

# make the chart interactive
chart.interactive()

Encodings¶

X: x-axis value
Y: y-axis value
Color: color of the mark
Opacity: transparency/opacity of the mark
Shape: shape of the mark
Size: size of the mark
Row: row within a grid of facet plots
Column: column within a grid of facet plots

For a complete list of these encodings, see the Encodings section of the documentation.

Altair charts can be fully specified as JSON $\rightarrow$ easy to embed in HTML on websites!

# Save the chart as a JSON string!
json = chart.to_json()

# Print out the first 1,000 characters
print(json[:1000])

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
  "config": {
    "view": {
      "continuousHeight": 300,
      "continuousWidth": 400
    }
  },
  "data": {
    "name": "data-d00e1631cca48c544438d30d2b470e8a"
  },
  "datasets": {
    "data-d00e1631cca48c544438d30d2b470e8a": [
      {
        "bill_depth_mm": 18.7,
        "bill_length_mm": 39.1,
        "body_mass_g": 3750.0,
        "flipper_length_mm": 181.0,
        "island": "Torgersen",
        "sex": "male",
        "species": "Adelie",
        "year": 2007
      },
      {
        "bill_depth_mm": 17.4,
        "bill_length_mm": 39.5,
        "body_mass_g": 3800.0,
        "flipper_length_mm": 186.0,
        "island": "Torgersen",
        "sex": "female",
        "species": "Adelie",
        "year": 2007
      },
      {
        "bill_depth_mm": 18.0,
        "bill_length_mm": 40.3,
        "body_mass_g": 3250.0,
        "flipper_length_mm": 195.0,
        "island": "Torgersen",
        "sex": "female",

Publishing the visualization online¶

chart.save("chart.html")

# Display IFrame in IPython
from IPython.display import IFrame
IFrame('chart.html', width=600, height=375)

Usually, the function calls are chained together¶

chart = (
    alt.Chart(penguins)
    .mark_circle(size=60)
    .encode(
        x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
        y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
        color="species:N",
    )
    .interactive()
)

chart

Note that the interactive() call allows users to pan and zoom.

Altair is able to automatically determine the type of the variable using built-in heuristics. Altair and Vega-Lite support four primitive data types:

Data Type	Code	Description
quantitative	Q	Numerical quantity (real-valued)
nominal	N	Name / Unordered categorical
ordinal	O	Ordered categorial
temporal	T	Date/time

You can set the data type of a column explicitly using a one letter code attached to the column name with a colon:

Faceting¶

Easily create multiple views of a dataset.

alt.Chart(penguins).mark_point().encode(
    x=alt.X("flipper_length_mm:Q", scale=alt.Scale(zero=False)), 
    y=alt.Y("bill_length_mm:Q", scale=alt.Scale(zero=False)),
    color="species:N"
).properties(
    width=200, height=200
).facet(column="species").interactive()

Note: I've added the variable type identifiers (Q, N) to the previous example

Lots of features to create compound charts: repeated charts, faceted charts, vertical and horizontal stacking of subplots.

See the documentation for examples

A grammar of interaction¶

A relatively new addition to altair, vega, and vega-lite. This allows you to define what happens when users interact with your visualization.

A faceted plot, now with interaction!¶

# create the selection box
brush = alt.selection_interval()


alt.Chart(penguins).mark_point().encode(
    x=alt.X(
        "flipper_length_mm", scale=alt.Scale(zero=False)
    ), # x
    y=alt.Y(
        "bill_length_mm", scale=alt.Scale(zero=False)
    ), # y
    color=alt.condition(
        brush, "species", alt.value("lightgray")
    ), # color
    tooltip=["species", "flipper_length_mm", "bill_length_mm"], 
).properties(
    width=200, height=200, selection=brush
).facet(column="species")

More on conditions¶

We used the alt.condition() function to specify a conditional color for the markers. It takes three arguments:

The brush object determines if a
If inside the brush, color the marker according to the "species" column
If outside the brush, use the literal hex color "lightgray"

Selecting across multiple variables¶

Let's examine the relationship between flipper_length_mm, bill_length_mm, and body_mass_g

We'll use a repeated chart that repeats variables across rows and columns.

Use a conditional color again, based on a brush selection.

# Setup the selection brush
brush = alt.selection(type='interval', resolve='global')

# Setup the chart
alt.Chart(penguins).mark_circle().encode(
    x=alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
    y=alt.Y(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
    color=alt.condition(brush, 'species:N', alt.value('lightgray')), # conditional color
).properties(
    width=200,
    height=200, 
    selection=brush
).repeat( # repeat variables across rows and columns 
    row=['flipper_length_mm', 'bill_length_mm', 'body_mass_g'],
    column=['body_mass_g', 'bill_length_mm', 'flipper_length_mm']
)

More exploratory visualization¶

Let's explore the relationship between flipper length, body mass, and sex.

Scatter flipper length vs body mass for each species, colored by sex

alt.Chart(penguins).mark_point().encode(
    x=alt.X('flipper_length_mm', scale=alt.Scale(zero=False)),
    y=alt.Y('body_mass_g', scale=alt.Scale(zero=False)),
    color=alt.Color("sex:N", scale=alt.Scale(scheme="Set2")),
).properties(
    width=400, height=150
).facet(row='species')

Note: Changing the color scheme¶

I've specified the scale keyword to the alt.Color() object and passed a scheme value:

scale=alt.Scale(scheme="Set2")

Set2 is a Color Brewer color. The available color schemes are very similar to those matplotlib. A list is available on the Vega documentation: https://vega.github.io/vega/docs/schemes/.

Next, plot the total number of penguins per species by the island they are found on.

alt.Chart(penguins).mark_bar().encode(
    x=alt.X('*:Q', aggregate='count',  stack='normalize'),
    y='island:N',
    color='species:N',
    tooltip=['island','species', 'count(*):Q']
)

Plot a histogram of number of penguins by flipper length, grouped by species.

alt.Chart(penguins).mark_bar().encode(
    x=alt.X('flipper_length_mm', bin=alt.Bin(maxbins=20)),
    y='count():Q',
    color='species',
    tooltip=['species', alt.Tooltip('count()', title='Number of Penguins')]
).properties(
    height=250
)

Finally, let's bin the data by body mass and plot the average flipper length per bin, colored by the species.

alt.Chart(penguins.dropna()).mark_line().encode(
    x=alt.X("body_mass_g:Q", bin=alt.Bin(maxbins=10)),
    y=alt.Y('mean(flipper_length_mm):Q', scale=alt.Scale(zero=False)), # apply a mean to the flipper length in each bin
    color='species:N',
    tooltip=['mean(flipper_length_mm):Q', "count():Q"]
).properties(
    height=300,
    width=500
)

In addition to mean() and count(), you can apply a number of different transformations to the data before plotting, including binning, arbitrary functions, and filters.

See the Data Transformations section of the user guide for more details.

Dashboards become easy to make...¶

# Setup a brush selection
brush = alt.selection(type='interval')

# The top scatterplot: flipper length vs bill length
points = alt.Chart().mark_point().encode(
    x=alt.X('flipper_length_mm:Q', scale=alt.Scale(zero=False)),
    y=alt.Y('bill_length_mm:Q', scale=alt.Scale(zero=False)),
    color=alt.condition(brush, 'species:N', alt.value('lightgray'))
).properties(
    selection=brush,
    width=800
)

# the bottom bar plot
bars = alt.Chart().mark_bar().encode(
    x='count(species):Q',
    y='species:N',
    color='species:N',
).transform_filter(
    brush.ref() # the filter transform uses the selection
                # to filter the input data to this chart
).properties(
width=800
)

chart = alt.vconcat(points, bars, data=penguins) # vertical stacking
chart

Now onto a more interesting example¶

Exercise: let's reproduce this famous Wall Street Journal visualization showing measles incidence over time.

http://graphics.wsj.com/infectious-diseases-and-vaccines/

Step 1: Load the data¶

First confirm the local path of the Jupyter notebook¶

Use two command line functions: pwd and ls
Don't forget you need a ! before the command to tell Jupyter it's a command line function

# Print out the current working directory
! pwd

/Users/nhand/Teaching/PennMUSA/Fall2020/week-2

# List all of the current working directories 
! ls

README.md
altair-data-ca57ab90d3f95b1c80eba4532570d68b.json
altair-data-d676c5169a80f978f8cf008b5d336e79.json
altair-data-f120b4a93567957066f3d75b4b5cdc6f.json
chart.html
data
environment.yml
lecture-2A.html
lecture-2A.ipynb
lecture-2B-solutions.ipynb
lecture-2B.html
lecture-2B.ipynb
outline.md

url = 'data/measles_incidence.csv' # this is a relative path 
data = pd.read_csv(url, skiprows=2, na_values='-')
data.head()

Note: data is weekly

Step 2: Calculate the total incidents in a given year per state¶

You'll want to take advantage of the groupby() then sum() work flow.

# drop week first
annual = data.drop('WEEK', axis=1)

grped = annual.groupby('YEAR')
print(grped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd471c06990>

annual = grped.sum()
annual

Step 3: Transform to tidy format¶

You can use melt() to get tidy data. You should have 3 columns: year, state, and total incidence.

measles = annual.reset_index()
measles.head()

measles = measles.melt(id_vars='YEAR', var_name='state', value_name='incidence')
measles.head(n=10)

Step 4: Make the plot¶

Take a look at this simple heatmap for an example of the syntax of Altair's heatmap functionality.
You can use the mark_rect() function to encode the values as rectangles and then color them according to the average annual measles incidence per state.

You'll want to take advantage of the custom color map defined below to best match the WSJ's graphic.

# Define a custom colormape using Hex codes & HTML color names
colormap = alt.Scale(
    domain=[0, 100, 200, 300, 1000, 3000],
    range=[
        "#F0F8FF",
        "cornflowerblue",
        "mediumseagreen",
        "#FFEE00",
        "darkorange",
        "firebrick",
    ],
    type="sqrt",
)

Avoiding large data error¶

See the documentation for more information.

For data sources with larger than 5,000 rows, you'll need to run the code below for Altair to work — it forces Altair save a local copy of the data.

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

# Heatmap of YEAR vs state, colored by incidence
chart = (
   alt.Chart(measles)
   .mark_rect()
   .encode(
       x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
       y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
       color=alt.Color("incidence:Q", sort="ascending", scale=colormap),
       tooltip=["state", "YEAR", "incidence"],
   )
   .properties(width=700, height=500)
)

chart

Add the vaccination line!¶

threshold = pd.DataFrame([{"threshold": 1963}])
threshold

# Vertical line for vaccination year
threshold = pd.DataFrame([{"threshold": 1963}])
rule = alt.Chart(threshold).mark_rule(strokeWidth=4).encode(x="threshold:O")

chart + rule

Note: I've used the "+" shorthand operator for layering two charts on top of each other — see the documentation on Layered Charts for more info!

Challenges¶

Do you agree with the visualization choices made by the WSJ?
- Try experimenting with different color scales to see if you can improve the heatmap
- See the names of available color maps in Altair
Try adding a second chart above the heatmap that shows a line chart of the annual average across all 50 states.

Exploring other color maps¶

The categorical color scale choice is properly not the best. It's best to use a perceptually uniform color scale like viridis. See below:

# Heatmap of YEAR vs state, colored by incidence
chart = (
    alt.Chart(measles)
    .mark_rect()
    .encode(
        x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
        y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
        color=alt.Color(
            "incidence:Q",
            sort="ascending",
            scale=alt.Scale(scheme="viridis"),
            legend=None,
        ),
        tooltip=["state", "YEAR", "incidence"],
    )
    .properties(width=700, height=400)
)

# Vertical line for vaccination year
rule = (
    alt.Chart(threshold).mark_rule(strokeWidth=4, color="white").encode(x="threshold:O")
)

chart + rule

Add the annual average chart on top¶

# The heatmap
chart = (
    alt.Chart(measles)
    .mark_rect()
    .encode(
        x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
        y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
        color=alt.Color(
            "incidence:Q",
            sort="ascending",
            scale=alt.Scale(scheme="viridis"),
            legend=None,
        ),
        tooltip=["state", "YEAR", "incidence"],
    )
    .properties(width=700, height=400)
)

# The annual average
annual_avg = (
    alt.Chart(measles)
    .mark_line()
    .encode(
        x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
        y=alt.Y("mean(incidence):Q", axis=alt.Axis(title=None, ticks=False)),
    )
    .properties(width=700, height=200)
)

# Add the vertical line 
rule = (
    alt.Chart(threshold).mark_rule(strokeWidth=4, color="white").encode(x="threshold:O")
)

# Combine everything
alt.vconcat(annual_avg, chart + rule)

Homework #2 (required)¶

Exploratory visualization of a dataset of your choosing
Use matplotlib, seaborn, and altair
Using GitHub to submit
Due in two weeks

https://github.com/MUSA-550-Fall-2020/assignment-2

That's it¶

Geospatial analysis and visualization next week!
Pre-recorded lecture will be posted on Sunday
See you next Thursday!

	YEAR	WEEK	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
0	1928	1	3.67	NaN	1.90	4.11	1.38	8.38	4.50	8.58	...	5.69	22.03	1.18	0.4	0.28	NaN	14.83	3.36	1.54	0.91
1	1928	2	6.25	NaN	6.40	9.91	1.80	6.02	9.00	7.30	...	6.57	16.96	0.63	NaN	0.56	NaN	17.34	4.19	0.96	NaN
2	1928	3	7.95	NaN	4.50	11.15	1.31	2.86	8.81	15.88	...	2.04	24.66	0.62	0.2	1.12	NaN	15.67	4.19	4.79	1.36
3	1928	4	12.58	NaN	1.90	13.75	1.87	13.71	10.40	4.29	...	2.19	18.86	0.37	0.2	6.70	NaN	12.77	4.66	1.64	3.64
4	1928	5	8.03	NaN	0.47	20.79	2.38	5.13	16.80	5.58	...	3.94	20.05	1.57	0.4	6.70	NaN	18.83	7.37	2.91	0.91

	ALABAMA	ALASKA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	FLORIDA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
YEAR
1928	334.99	0.00	200.75	481.77	69.22	206.98	634.95	256.02	535.63	119.58	...	160.16	315.43	97.35	16.83	334.80	0.00	344.82	195.98	124.61	227.00
1929	111.93	0.00	54.88	67.22	72.80	74.24	614.82	239.82	94.20	78.01	...	167.77	33.04	71.28	68.90	105.31	0.00	248.60	380.14	1016.54	312.16
1930	157.00	0.00	466.31	53.44	760.24	1132.76	112.23	109.25	182.10	356.59	...	346.31	179.91	73.12	1044.79	236.69	0.00	631.64	157.70	748.58	341.55
1931	337.29	0.00	497.69	45.91	477.48	453.27	790.46	1003.28	832.99	260.79	...	212.36	134.79	39.56	29.72	318.40	0.00	197.43	291.38	506.57	60.69
1932	10.21	0.00	20.11	5.33	214.08	222.90	348.27	15.98	53.14	13.63	...	96.37	68.99	76.58	13.91	1146.08	53.40	631.93	599.65	935.31	242.10
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1999	0.00	0.00	0.02	0.16	0.06	0.00	0.00	0.00	0.00	0.01	...	0.00	0.00	0.01	0.10	0.00	0.25	0.00	0.00	0.00	0.00
2000	0.00	0.16	0.00	0.04	0.03	0.06	0.00	0.00	0.00	0.02	...	0.00	0.00	0.00	0.13	0.16	0.03	0.00	0.00	0.00	0.00
2001	0.00	0.00	0.02	0.00	0.11	0.00	0.00	0.00	0.00	0.00	...	0.00	0.00	0.00	0.00	0.16	0.01	0.00	0.00	0.00	0.00
2002	0.16	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.01	...	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
2003	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	...	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00

	YEAR	ALABAMA	ARIZONA	ARKANSAS	CALIFORNIA	COLORADO	CONNECTICUT	DELAWARE	DISTRICT OF COLUMBIA	...	SOUTH DAKOTA	TENNESSEE	TEXAS	UTAH	VERMONT	VIRGINIA	WASHINGTON	WEST VIRGINIA	WISCONSIN	WYOMING
0	1928	334.99	200.75	481.77	69.22	206.98	634.95	256.02	535.63	...	160.16	315.43	97.35	16.83	334.80	0.0	344.82	195.98	124.61	227.00
1	1929	111.93	54.88	67.22	72.80	74.24	614.82	239.82	94.20	...	167.77	33.04	71.28	68.90	105.31	0.0	248.60	380.14	1016.54	312.16
2	1930	157.00	466.31	53.44	760.24	1132.76	112.23	109.25	182.10	...	346.31	179.91	73.12	1044.79	236.69	0.0	631.64	157.70	748.58	341.55
3	1931	337.29	497.69	45.91	477.48	453.27	790.46	1003.28	832.99	...	212.36	134.79	39.56	29.72	318.40	0.0	197.43	291.38	506.57	60.69
4	1932	10.21	20.11	5.33	214.08	222.90	348.27	15.98	53.14	...	96.37	68.99	76.58	13.91	1146.08	53.4	631.93	599.65	935.31	242.10

	YEAR	state	incidence
0	1928	ALABAMA	334.99
1	1929	ALABAMA	111.93
2	1930	ALABAMA	157.00
3	1931	ALABAMA	337.29
4	1932	ALABAMA	10.21
5	1933	ALABAMA	65.22
6	1934	ALABAMA	590.27
7	1935	ALABAMA	265.34
8	1936	ALABAMA	20.78
9	1937	ALABAMA	22.46

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007
6	Adelie	Torgersen	38.9	17.8	181.0	3625.0	female	2007
7	Adelie	Torgersen	39.2	19.6	195.0	4675.0	male	2007
8	Adelie	Torgersen	34.1	18.1	193.0	3475.0	NaN	2007
9	Adelie	Torgersen	42.0	20.2	190.0	4250.0	NaN	2007