Sep 8, 2020
Questions / concerns?
Guides to installing Python, using conda for managing packages, and working with Jupyter notebook on course website:
Starting with two of my favorite historical examples, and their modern renditions...
See http://projects.flowingdata.com/atlas, by Nathan Yau
ggplot2
provides an R implementation of The Grammar of GraphicsWho knows how many climate disasters could have been avoided with a tan background and ten minutes of color theory. - Elijah Meeks
See, e.g. Data Sketches
by Simon Scarr in 2011
The same data, but different design choices...
Some recent examples...
Lots of companies, cities, institutions, etc. have started design guidelines to improve and standardize their data visualizations.
One I particularly like: City of London Data Design Guidelines
First few pages are listed in the "Recommended Reading" portion of this week's README.
London's style guide includes some basic data viz principles that everyone should know and includes the following example:
Choose your colors carefully:
matplotlib
matplotlib
and available by defaultFor quantitative data, these color maps are very strong options
Almost too many tools available...
So many tools...so little time
You'll use different packages to achieve different goals, and they each have different things they are good at.
Today, we'll focus on:
And next week for geospatial data:
Goal: introduce you to the most common tools and enable you to know the best package for the job in the future
We'll use the object-oriented interface to matplotlib
Figure
and Axes
objectsAxes
objectFigure
or Axes
objectsTo make matplotlib figures show up in notebooks, you'll need to use include the following line in your notebooks:
%matplotlib inline
This is a magic function that initializes matplotlib properly in the notebook
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# generate some random data using numpy (numbers between -1 and 1)
# shape is (100, 100)
data = 2 * np.random.random(size=(100,100)) - 1
print(data.min(), data.max(), data.mean())
plt.pcolormesh(data, cmap='viridis')
plt.pcolormesh(data, cmap='jet')
plt.pcolormesh(data, cmap='coolwarm')
Important bookmark: Choosing Color Maps in Matplotlib
# print out all available color map names
print(len(plt.colormaps()))
We'll use the Palmer penguins data set, data collected for three species of penguins at Palmer station in Antartica
Artwork by @allison_horst
import pandas as pd
# Load data on Palmer penguins
penguins = pd.read_csv("./data/penguins.csv")
penguins.head(n=10)
Data is already in tidy format
I want to scatter flipper length vs. bill length, colored by the penguin species
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Color for each species
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
# Group the data frame by species and loop over each group
# NOTE: "group" will be the dataframe holding the data for "species"
for species, group in penguins.groupby("species"):
print(f"Plotting {species}...")
# Plot flipper length vs bill length for this group
ax.scatter(
group["flipper_length_mm"],
group["bill_length_mm"],
marker="o",
label=species,
color=color_map[species],
alpha=0.75,
)
# Format the axes
ax.legend(loc="best")
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
pandas
?¶# Tab complete on the plot attribute of a dataframe to see the available functions
#penguins.plot.scatter?
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate a list of colors
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
colors = [color_map[species] for species in penguins["species"]]
# Scatter plot two columns, colored by third
penguins.plot.scatter(
x="flipper_length_mm",
y="bill_length_mm",
c=colors,
alpha=0.75,
ax=ax, # Plot on the axes object we created already!
)
# Format
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
Note: no easy way to get legend added to the plot in this case...
pandas
plotting capabilities are good for quick and unpolished plots during the data exploration phaseimport seaborn as sns
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# style keywords as dict
style = dict(palette=color_map, s=60, edgecolor="none", alpha=0.75)
# use the scatterplot() function
sns.scatterplot(
x="flipper_length_mm", # the x column
y="bill_length_mm", # the y column
hue="species", # the third dimension (color)
data=penguins, # pass in the data
ax=ax, # plot on the axes object we made
**style # add our style keywords
)
# Format with matplotlib commands
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
ax.legend(loc='best')
The **
syntax is the unpacking operator. It will unpack the dictionary and pass each keyword to the function.
So the previous code is the same as:
sns.scatterplot(
x="flipper_length_mm",
y="bill_length_mm",
hue="species",
data=penguins,
ax=ax,
palette=color_map, # defined in the style dict
edgecolor="none", # defined in the style dict
alpha=0.5 # defined in the style dict
)
But we can use **style
as a shortcut!
In general, seaborn
is fantastic for visualizing relationships between variables in a more quantitative way
Don't memorize every function...
I always look at the beautiful Example Gallery for ideas.
How about adding linear regression lines?
Use lmplot()
sns.lmplot(
x="flipper_length_mm",
y="bill_length_mm",
hue="species",
data=penguins,
height=6,
aspect=1.5,
palette=color_map,
scatter_kws=dict(edgecolor="none", alpha=0.5),
);
Use jointplot()
sns.jointplot(
x="flipper_length_mm",
y="bill_length_mm",
data=penguins,
height=8,
kind="kde",
cmap="viridis",
);
Use pairplot()
# The variables to plot
variables = [
"species",
"bill_length_mm",
"flipper_length_mm",
"body_mass_g",
"bill_depth_mm",
]
# Set the seaborn style
sns.set_context("notebook", font_scale=1.5)
# make the pair plot
sns.pairplot(
penguins[variables].dropna(),
palette=color_map,
hue="species",
plot_kws=dict(alpha=0.5, edgecolor="none"),
)
sns.catplot(x="species", y="bill_length_mm", hue="sex", data=penguins);
Great tutorial available in the seaborn documentation
The color_palette
function in seaborn is very useful. Easiest way to get a list of hex strings for a specific color map.
viridis = sns.color_palette("viridis", n_colors=7).as_hex()
print(viridis)
sns.palplot(viridis)
You can also create custom light, dark, or diverging color maps, based on the desired hues at either end of the color map.
sns.palplot(sns.diverging_palette(10, 220, sep=50, n=7))