Sep 22, 2020
Two parts:
Proper data visualization is crucial throughout all of the steps of the data science pipeline: data wrangling, modeling, and storytelling
GeoViews builds on HoloViews to add support for geographic data
hvPlot
fit in?¶hvPlot
package¶It's relatively new: officially released in February 2019
%%html
<center>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">We are very pleased officially announce the release of hvPlot! It provides a high-level plotting API for the PyData ecosystem including <a href="https://twitter.com/pandas_dev?ref_src=twsrc%5Etfw">@pandas_dev</a>, <a href="https://twitter.com/xarray_dev?ref_src=twsrc%5Etfw">@xarray_dev</a>, <a href="https://twitter.com/dask_dev?ref_src=twsrc%5Etfw">@dask_dev</a>, <a href="https://twitter.com/geopandas?ref_src=twsrc%5Etfw">@geopandas</a> and more, generating interactive <a href="https://twitter.com/datashader?ref_src=twsrc%5Etfw">@datashader</a> and <a href="https://twitter.com/BokehPlots?ref_src=twsrc%5Etfw">@BokehPlots</a>. <a href="https://t.co/Loc5XElJUL">https://t.co/Loc5XElJUL</a></p>— HoloViews (@HoloViews) <a href="https://twitter.com/HoloViews/status/1092409050283819010?ref_src=twsrc%5Etfw">February 4, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>
An interface just like the pandas
plot() function, but much more useful.
# Our usual imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
# let's load the measles data from week 2
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2020/week-2/master/data/measles_incidence.csv"
measles_data_raw = pd.read_csv(url, skiprows=2, na_values='-')
measles_data_raw.head()
measles_data = measles_data_raw.melt(id_vars=["YEAR", "WEEK"], value_name="incidence", var_name="state")
measles_data.head()
pandas
¶The default .plot()
doesn't know which variables to plot.
fig, ax = plt.subplots(figsize=(10, 6))
measles_data.plot(ax=ax)
But we can group by the year, and plot the national average each year
by_year = measles_data.groupby("YEAR")['incidence'].sum()
by_year.head()
fig, ax = plt.subplots(figsize=(10, 6))
# Plot the annual average by year
by_year.plot(ax=ax)
# Add the vaccine year and label
ax.axvline(x=1963, c='k', linewidth=2)
ax.text(1963, 27000, " Vaccine introduced", ha='left', fontsize=18);
hvplot
¶Use the .hvplot()
to create interactive plots.
# This will add the .hvplot() function to your DataFrame!
import hvplot.pandas
# This registers "bokeh" as the desired backend for the interactive plots
import holoviews as hv
hv.extension("bokeh")
img = by_year.hvplot(kind='line')
img
In this case, .hvplot()
creates a Holoviews
Curve
object.
Not unlike altair
Chart
objects, it's an object that knows how to translate from your DataFrame data to a visualization.
print(img)
by_year.hvplot(kind='scatter')
by_year.hvplot(kind='bar', rot=90, width=1000)
Use the *
operator to layer together chart elements.
Note: the same thing can be accomplished in altair, but with the +
operator.
# The line chart of incidence vs year
incidence = by_year.hvplot(kind='line')
# Vertical line + label for vaccine year
vline = hv.VLine(1963).opts(color='black')
label = hv.Text(1963, 27000, " Vaccine introduced", halign='left')
final_chart = incidence * vline * label
final_chart
This is some powerful magic.
Let's calculate the annual measles incidence for each year and state:
by_state = measles_data.groupby(['YEAR', 'state'])['incidence'].sum()
by_state.head()
Now, tell hvplot
to plot produce charts for each state:
by_state_chart = by_state.hvplot(x="YEAR",
y="incidence",
groupby="state",
width=400,
kind="line")
by_state_chart
PA = by_state_chart['PENNSYLVANIA'].relabel('PA')
NJ = by_state_chart['NEW JERSEY'].relabel('NJ')
+
operator¶combined = PA + NJ
combined
print(combined)
The charts are side-by-side by default. You can also specify the number of rows/columns explicitly.
# one column
combined.cols(1)
Using the by
keyword:
states = ['NEW YORK', 'NEW JERSEY', 'CALIFORNIA', 'PENNSYLVANIA']
sub_states = by_state.loc[:, states]
sub_state_chart = sub_states.hvplot(x='YEAR',
y='incidence',
by='state',
kind='line')
sub_state_chart * vline
Just like in altair, when we used the alt.Chart().facet(column='state')
syntax
Below, we specify the state
column should be mapped to each column:
img = sub_states.hvplot(x="YEAR",
y='incidence',
col="state",
kind="line",
rot=90,
frame_width=200) * vline
img
# by_state.hvplot.
by_state.loc[1960:1970, states].hvplot.bar(x='YEAR',
y='incidence',
by='state', rot=90)
Change bar()
to line()
and we get the same thing as before.
by_state.loc[1960:1970, states].hvplot.line(x='YEAR',
y='incidence',
by='state', rot=90)
See the help message for explicit hvplot functions:
by_state.hvplot?
by_state.hvplot.line?
Can we reproduce the WSJ measles heatmap that we made in altair in week 2?
Use the help function:
measles_data.hvplot.heatmap?
We want to plot 'YEAR' on the x axis, 'state' on the y axis, and specify 'incidence' as the values begin plotted in each heatmap bin.
measles_data
) with columns for state, week, year, and incidencereduce_function
keyword to sum over weeksby_state
data frame which has already summed over weeks for each state# METHOD #1: just plot the incidence
heatmap = by_state.hvplot.heatmap(
x="YEAR",
y="state",
C="incidence",
cmap="viridis",
height=500,
width=1000,
flip_yaxis=True,
rot=90,
)
heatmap.redim(
state="State", YEAR="Year",
)
## METHOD 2: hvplot does the aggregation
heatmap = measles_data.hvplot.heatmap(
x="YEAR",
y="state",
C="incidence",
cmap='viridis',
reduce_function=np.sum,
height=500,
width=1000,
flip_yaxis=True,
rot=90,
)
heatmap.redim(state="State", YEAR="Year")
import hvplot
hvplot.save(heatmap, 'measles.html')
# load the html file and display it
from IPython.display import HTML
HTML('measles.html')
Let's load the penguins data set from week 2
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2020/week-2/master/data/penguins.csv"
penguins = pd.read_csv(url)
penguins.head()
Use the hvplot.scatter_matrix()
function:
penguins.hvplot.scatter?
columns = ['flipper_length_mm',
'bill_length_mm',
'body_mass_g',
'species']
hvplot.scatter_matrix(penguins[columns], c='species')
Note the "box select" and "lasso" features on the tool bar for interactions
.hvplot()
functionLet's load some geographic data for countries:
import geopandas as gpd
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head()
fig, ax = plt.subplots(figsize=(10,10))
world.plot(column='gdp_md_est', ax=ax)
ax.set_axis_off()
world.hvplot.polygons?
# Can also just do world.hvplot()
world.hvplot.polygons(c='gdp_md_est',
geo=True,
frame_height=400)
If you recall last week's exercise...
import geopandas as gpd
# Load the data
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2020/week-3/master/data/opa_residential.csv"
data = pd.read_csv(url)
# Create the Point() objects
data['Coordinates'] = gpd.points_from_xy(data['lng'], data['lat'])
# Create the GeoDataFrame
data = gpd.GeoDataFrame(data, geometry='Coordinates', crs="EPSG:4326")
# load the Zillow data from GitHub
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2020/week-3/master/data/zillow_neighborhoods.geojson"
zillow = gpd.read_file(url)
# Important: Make sure the CRS match
data = data.to_crs(zillow.crs)
# perform the spatial join
data = gpd.sjoin(data, zillow, op='within', how='left')
# Calculate the median market value per Zillow neighborhood
median_values = data.groupby('ZillowName', as_index=False)['market_value'].median()
# Merge median values with the Zillow geometries
median_values = zillow.merge(median_values, on='ZillowName')
print(type(median_values))
median_values.head()
median_values.crs
# pass arguments directly to hvplot()
# and it recognizes polygons automatically
median_values.hvplot(c='market_value',
frame_width=600,
frame_height=500,
geo=True,
cmap='viridis',
hover_cols=['ZillowName'])
geo=True
assumes EPSG:4326¶If you specify geo=True
, the data needs to be in typical lat/lng CRS. If not, you can use the crs
keyword to specify the type of CRS your data is in.
median_values_3857 = median_values.to_crs(epsg=3857)
median_values_3857.crs
median_values_3857.hvplot(c='market_value',
frame_width=600,
frame_height=500,
geo=True,
crs=3857, # NEW: specify the CRS
cmap='viridis',
hover_cols=['ZillowName'])
Let's add a tile source underneath the choropleth map
import geoviews as gv
import geoviews.tile_sources as gvts
%%opts WMTS [width=800, height=800, xaxis=None, yaxis=None]
choro = median_values.hvplot(c='market_value',
width=500,
height=400,
alpha=0.5,
geo=True,
cmap='viridis',
hover_cols=['ZillowName'])
gvts.ESRI * choro
print(type(gvts.ESRI))
%%opts WMTS [width=200, height=200, xaxis=None, yaxis=None]
(gvts.OSM + gvts.Wikipedia + gvts.StamenToner + gvts.EsriNatGeo +
gvts.EsriImagery + gvts.EsriUSATopo + gvts.EsriTerrain + gvts.CartoDark).cols(4)
Note: we've used the %%opts
cell magic to apply syling options to any charts generated in the cell.
See the documentation guide on customizations for more details.
We had a good question last class — is there an interactive version of matplotlib's hex bin?
You can do it with hvplot! Sort of.
data['x'] = data.geometry.x
data['y'] = data.geometry.y
data.head()
x
and y
coordinate columns and the associated market_value
columngeometry
columnsubdata = data[['x', 'y', 'market_value']]
type(subdata)
C
column to aggregate for each bin (raw counts are shown if not provided)reduce_function
that determines how to aggregate the C
columndata.hvplot.hexbin?
subdata.head()
subdata.hvplot.hexbin(x='x',
y='y',
C='market_value',
reduce_function=np.median,
logz=True,
geo=True,
gridsize=40,
cmap='viridis')
Not the prettiest but it gets the job done for some quick exploratory analysis!
Some very cool examples available in the galleries