Sep 10, 2020
Reminder: Links to course materials and main sites (Piazza, Canvas, Github) can be found on the home page of the main course website:
Eugene put up a great post on Piazza walking through somes tips for managing your folder structure on your laptop:
Recommended readings for the week listed here
Last Tuesday
We'll use the Palmer penguins data set, data collected for three species of penguins at Palmer station in Antartica
Artwork by @allison_horst
import pandas as pd
# Load data on Palmer penguins
penguins = pd.read_csv("./data/penguins.csv")
penguins.head(n=10)
import altair as alt
Important: focuses on tidy data — you'll often find yourself running pd.melt()
to get to tidy format
Let's try out our flipper length vs bill length example from last lecture...
# initialize the chart with the data
chart = alt.Chart(penguins)
# define what kind of marks to use
chart = chart.mark_circle(size=60)
# encode the visual channels
chart = chart.encode(
x="flipper_length_mm",
y="bill_length_mm",
color="species",
tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)
# make the chart interactive
chart.interactive()
Example: previous code is the same as
chart = chart.encode(
x=alt.X("flipper_length_mm"),
y=alt.Y("bill_length_mm"),
color=alt.Color("species"),
tooltip=alt.Tooltip(["species", "flipper_length_mm", "bill_length_mm", "island", "sex"]),
)
alt.Scale()
object to specify the scale# initialize the chart with the data
chart = alt.Chart(penguins)
# define what kind of marks to use
chart = chart.mark_circle(size=60)
# encode the visual channels
chart = chart.encode(
x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)
# make the chart interactive
chart.interactive()
For a complete list of these encodings, see the Encodings section of the documentation.
Altair charts can be fully specified as JSON $\rightarrow$ easy to embed in HTML on websites!
# Save the chart as a JSON string!
json = chart.to_json()
# Print out the first 1,000 characters
print(json[:1000])
chart.save("chart.html")
# Display IFrame in IPython
from IPython.display import IFrame
IFrame('chart.html', width=600, height=375)
chart = (
alt.Chart(penguins)
.mark_circle(size=60)
.encode(
x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species:N",
)
.interactive()
)
chart
Note that the interactive()
call allows users to pan and zoom.
Altair is able to automatically determine the type of the variable using built-in heuristics. Altair and Vega-Lite support four primitive data types:
Data Type | Code | Description |
---|---|---|
quantitative | Q | Numerical quantity (real-valued) |
nominal | N | Name / Unordered categorical |
ordinal | O | Ordered categorial |
temporal | T | Date/time |
You can set the data type of a column explicitly using a one letter code attached to the column name with a colon:
Easily create multiple views of a dataset.
alt.Chart(penguins).mark_point().encode(
x=alt.X("flipper_length_mm:Q", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm:Q", scale=alt.Scale(zero=False)),
color="species:N"
).properties(
width=200, height=200
).facet(column="species").interactive()
Note: I've added the variable type identifiers (Q, N) to the previous example
Lots of features to create compound charts: repeated charts, faceted charts, vertical and horizontal stacking of subplots.
See the documentation for examples
A relatively new addition to altair, vega, and vega-lite. This allows you to define what happens when users interact with your visualization.
# create the selection box
brush = alt.selection_interval()
alt.Chart(penguins).mark_point().encode(
x=alt.X(
"flipper_length_mm", scale=alt.Scale(zero=False)
), # x
y=alt.Y(
"bill_length_mm", scale=alt.Scale(zero=False)
), # y
color=alt.condition(
brush, "species", alt.value("lightgray")
), # color
tooltip=["species", "flipper_length_mm", "bill_length_mm"],
).properties(
width=200, height=200, selection=brush
).facet(column="species")
We used the alt.condition()
function to specify a conditional color for the markers. It takes three arguments:
brush
object determines if a brush
, color the marker according to the "species" columnbrush
, use the literal hex color "lightgray" Let's examine the relationship between flipper_length_mm
, bill_length_mm
, and body_mass_g
We'll use a repeated chart that repeats variables across rows and columns.
Use a conditional color again, based on a brush selection.
# Setup the selection brush
brush = alt.selection(type='interval', resolve='global')
# Setup the chart
alt.Chart(penguins).mark_circle().encode(
x=alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
y=alt.Y(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
color=alt.condition(brush, 'species:N', alt.value('lightgray')), # conditional color
).properties(
width=200,
height=200,
selection=brush
).repeat( # repeat variables across rows and columns
row=['flipper_length_mm', 'bill_length_mm', 'body_mass_g'],
column=['body_mass_g', 'bill_length_mm', 'flipper_length_mm']
)
Let's explore the relationship between flipper length, body mass, and sex.
Scatter flipper length vs body mass for each species, colored by sex
alt.Chart(penguins).mark_point().encode(
x=alt.X('flipper_length_mm', scale=alt.Scale(zero=False)),
y=alt.Y('body_mass_g', scale=alt.Scale(zero=False)),
color=alt.Color("sex:N", scale=alt.Scale(scheme="Set2")),
).properties(
width=400, height=150
).facet(row='species')
I've specified the scale
keyword to the alt.Color()
object and passed a scheme
value:
scale=alt.Scale(scheme="Set2")
Set2
is a Color Brewer color. The available color schemes are very similar to those matplotlib. A list is available on the Vega documentation: https://vega.github.io/vega/docs/schemes/.
Next, plot the total number of penguins per species by the island they are found on.
alt.Chart(penguins).mark_bar().encode(
x=alt.X('*:Q', aggregate='count', stack='normalize'),
y='island:N',
color='species:N',
tooltip=['island','species', 'count(*):Q']
)
Plot a histogram of number of penguins by flipper length, grouped by species.
alt.Chart(penguins).mark_bar().encode(
x=alt.X('flipper_length_mm', bin=alt.Bin(maxbins=20)),
y='count():Q',
color='species',
tooltip=['species', alt.Tooltip('count()', title='Number of Penguins')]
).properties(
height=250
)
Finally, let's bin the data by body mass and plot the average flipper length per bin, colored by the species.
alt.Chart(penguins.dropna()).mark_line().encode(
x=alt.X("body_mass_g:Q", bin=alt.Bin(maxbins=10)),
y=alt.Y('mean(flipper_length_mm):Q', scale=alt.Scale(zero=False)), # apply a mean to the flipper length in each bin
color='species:N',
tooltip=['mean(flipper_length_mm):Q', "count():Q"]
).properties(
height=300,
width=500
)
In addition to mean()
and count()
, you can apply a number of different transformations to the data before plotting, including binning, arbitrary functions, and filters.
See the Data Transformations section of the user guide for more details.
# Setup a brush selection
brush = alt.selection(type='interval')
# The top scatterplot: flipper length vs bill length
points = alt.Chart().mark_point().encode(
x=alt.X('flipper_length_mm:Q', scale=alt.Scale(zero=False)),
y=alt.Y('bill_length_mm:Q', scale=alt.Scale(zero=False)),
color=alt.condition(brush, 'species:N', alt.value('lightgray'))
).properties(
selection=brush,
width=800
)
# the bottom bar plot
bars = alt.Chart().mark_bar().encode(
x='count(species):Q',
y='species:N',
color='species:N',
).transform_filter(
brush.ref() # the filter transform uses the selection
# to filter the input data to this chart
).properties(
width=800
)
chart = alt.vconcat(points, bars, data=penguins) # vertical stacking
chart
Exercise: let's reproduce this famous Wall Street Journal visualization showing measles incidence over time.
pwd
and ls
# Print out the current working directory
! pwd
# List all of the current working directories
! ls
url = 'data/measles_incidence.csv' # this is a relative path
data = pd.read_csv(url, skiprows=2, na_values='-')
data.head()
Note: data is weekly
You'll want to take advantage of the groupby()
then sum()
work flow.
# drop week first
annual = data.drop('WEEK', axis=1)
grped = annual.groupby('YEAR')
print(grped)
annual = grped.sum()
annual
You can use melt()
to get tidy data. You should have 3 columns: year, state, and total incidence.
measles = annual.reset_index()
measles.head()
measles = measles.melt(id_vars='YEAR', var_name='state', value_name='incidence')
measles.head(n=10)
mark_rect()
function to encode the values as rectangles and then color them according to the average annual measles incidence per state.You'll want to take advantage of the custom color map defined below to best match the WSJ's graphic.
# Define a custom colormape using Hex codes & HTML color names
colormap = alt.Scale(
domain=[0, 100, 200, 300, 1000, 3000],
range=[
"#F0F8FF",
"cornflowerblue",
"mediumseagreen",
"#FFEE00",
"darkorange",
"firebrick",
],
type="sqrt",
)
See the documentation for more information.
For data sources with larger than 5,000 rows, you'll need to run the code below for Altair to work — it forces Altair save a local copy of the data.
alt.data_transformers.enable('json')
# Heatmap of YEAR vs state, colored by incidence
chart = (
alt.Chart(measles)
.mark_rect()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
color=alt.Color("incidence:Q", sort="ascending", scale=colormap),
tooltip=["state", "YEAR", "incidence"],
)
.properties(width=700, height=500)
)
chart
threshold = pd.DataFrame([{"threshold": 1963}])
threshold
# Vertical line for vaccination year
threshold = pd.DataFrame([{"threshold": 1963}])
rule = alt.Chart(threshold).mark_rule(strokeWidth=4).encode(x="threshold:O")
chart + rule
Note: I've used the "+" shorthand operator for layering two charts on top of each other — see the documentation on Layered Charts for more info!
The categorical color scale choice is properly not the best. It's best to use a perceptually uniform color scale like viridis. See below:
# Heatmap of YEAR vs state, colored by incidence
chart = (
alt.Chart(measles)
.mark_rect()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
color=alt.Color(
"incidence:Q",
sort="ascending",
scale=alt.Scale(scheme="viridis"),
legend=None,
),
tooltip=["state", "YEAR", "incidence"],
)
.properties(width=700, height=400)
)
# Vertical line for vaccination year
rule = (
alt.Chart(threshold).mark_rule(strokeWidth=4, color="white").encode(x="threshold:O")
)
chart + rule
# The heatmap
chart = (
alt.Chart(measles)
.mark_rect()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
color=alt.Color(
"incidence:Q",
sort="ascending",
scale=alt.Scale(scheme="viridis"),
legend=None,
),
tooltip=["state", "YEAR", "incidence"],
)
.properties(width=700, height=400)
)
# The annual average
annual_avg = (
alt.Chart(measles)
.mark_line()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("mean(incidence):Q", axis=alt.Axis(title=None, ticks=False)),
)
.properties(width=700, height=200)
)
# Add the vertical line
rule = (
alt.Chart(threshold).mark_rule(strokeWidth=4, color="white").encode(x="threshold:O")
)
# Combine everything
alt.vconcat(annual_avg, chart + rule)