Sep 17, 2020
Last lecture
Today
# Let's setup the imports we'll need first
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import geopandas as gpd
%matplotlib inline
Load 311 requests in Philadelphia from the data/
directory.
Source: OpenDataPhilly
# Load the data from a CSV file into a pandas DataFrame
requests = pd.read_csv('./data/public_cases_fc_2020.csv')
print("number of requests = ", len(requests))
requests.head()
Remove the requests missing lat/lon coordinates
requests = requests.dropna(subset=['lat', 'lon'])
Create Point objects for each lat
and lon
combination.
We can use the helper utility function: geopandas.points_from_xy()
requests['Coordinates'] = gpd.points_from_xy(requests['lon'], requests['lat'])
requests['Coordinates'].head()
Now, convert to a GeoDataFrame.
Important
ESPG:4326
Since we're only using a few EPSG codes in this course, you can usually tell what the CRS is by looking at the values in the Point() objects.
Philadelphia has a latitude of about 40 deg and longitude of about -75 deg.
Our data must be in the usual lat/lng EPSG=4326.
requests = gpd.GeoDataFrame(requests,
geometry="Coordinates",
crs="EPSG:4326")
Group by the service name and calculate the size of each group:
service_types = requests.groupby('service_name').size()
Sort by the number (in descending order):
service_types = service_types.sort_values(ascending=False)
Slice the data to take the first 20 elements:
top20 = service_types.iloc[:20]
top20
trash_requests = requests.loc[
requests["service_name"] == "Rubbish/Recyclable Material Collection"
].copy()
print("The nuumber of trash-related requests = ", len(trash_requests))
See for example, this article in the Philadelphia Inquirer
# Convert the requested datetime to a column of Datetime objects
trash_requests['requested_datetime'] = pd.to_datetime(trash_requests['requested_datetime'])
# Use the .dt attribute to extract out the month name
trash_requests['month'] = trash_requests['requested_datetime'].dt.month_name()
TL;DR: This is usually fine!
If you select a subset of a dataframe (a "slice") and then make changes (like adding a new column), you will get this warning. There is a good discussion of the issue on StackOverflow.
You can usually make this go away if you add a .copy()
after you perform your selection. For example, this warning will go away if we had done:
trash_requests = requests.loc[requests["service_name"] == "Rubbish/Recyclable Material Collection"].copy()
totals_by_week = trash_requests.groupby("month", as_index=False).size()
totals_by_week.head()
as_index=False
syntax here¶This will force the size() function to return a DataFrame instead of having the month
column as the index of the resulted groupby operation.
It saves us from having to do the .reset_index()
function call after running the .size()
function.
For making static bar charts with Python, seaborn's sns.barplot()
is the best option
import seaborn as sns
# Initialize figure/axes
fig, ax = plt.subplots(figsize=(12, 6))
# Plot!
sns.barplot(
x="month",
y="size",
data=totals_by_week,
color="#2176d2",
ax=ax,
order=["January", "February", "March", "April", "May", "June", "July", "August", "September"],
);
The trend is clear in the previous chart, but can we do a better job with the aesthetics? Yes!
For reference, here is a common way to clean up charts in matplotlib:
# Initialize figure/axes
fig, ax = plt.subplots(figsize=(12, 6))
# Plot!
sns.barplot(
x="month",
y="size",
data=totals_by_week,
color="#2176d2",
ax=ax,
order=["January", "February", "March", "April", "May", "June", "July", "August", "September"],
zorder=999 # Make sure the bar charts are on top of the grid
)
# Remove x/y axis labels
ax.set_xlabel("")
ax.set_ylabel("")
# Format the ytick labels to use a comma and no decimal places
ax.set_yticklabels([f"{yval:,.0f}" for yval in ax.get_yticks()] )
# Add a grid backgrou d
ax.grid(True, axis='y')
# Remove the top and right axes lines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Add a title
ax.set_title("Philadelphia's Trash-Related 311 Requests in 2020", weight='bold', fontsize=16);
The original data has EPSG=4326. We'll convert to EPSG=3857.
trash_requests = trash_requests.to_crs(epsg=3857)
trash_requests.head()
A GeoJSON holding Zillow definitions for Philadelphia neighborhoods is available in the data/
directory.
zillow = gpd.read_file('data/zillow_neighborhoods.geojson')
zillow = zillow.to_crs(epsg=3857)
zillow.head()
fig, ax = plt.subplots(figsize=(8, 8))
ax = zillow.plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_axis_off()
ax.set_aspect("equal")
Use the sjoin()
function to match point data (requests) to polygon data (neighborhoods)
joined = gpd.sjoin(trash_requests, zillow, op='within', how='left')
joined.head()
Note that this operation can be slow
Group by neighborhood and calculate the size:
totals = joined.groupby('ZillowName', as_index=False).size()
type(totals)
Note: we're once again using the as_index=False
to ensure the result of the .size()
function is a DataFrame rather than a Series with the ZillowName
as its index
totals.head()
Lastly, merge Zillow geometries (GeoDataFrame) with the total # of requests per neighborhood (DataFrame).
Important
When merging a GeoDataFrame (spatial) and DataFrame (non-spatial), you should always call the .merge()
function of the spatial data set to ensure that the merged data is a GeoDataFrame.
For example...
totals = zillow.merge(totals, on='ZillowName')
totals.head()
Choropleth maps color polygon regions according to the values of a specific data attribute. They are built-in to GeoDataFrame objects.
First, plot the total number of requests per neighborhood.
# Create the figure/axes
fig, ax = plt.subplots(figsize=(8, 8))
# Plot
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
cmap="viridis"
)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
# Needed to line up the colorbar properly
from mpl_toolkits.axes_grid1 import make_axes_locatable
# Create the figure
fig, ax = plt.subplots(figsize=(8, 8))
# NEW: Create a nice, lined up colorbar axes (called "cax" here)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.2)
# Plot
totals.plot(
ax=ax,
cax=cax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
cmap="viridis",
)
# NEW: Get the limits of the GeoDataFrame
xmin, ymin, xmax, ymax = totals.total_bounds
# NEW: Set the xlims and ylims
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
These improvements are optional, but they definitely make for nicer plots!
Yes, built-in to the plot()
function!
Many different schemes, but here are some of the most common ones:
# Quantiles Scheme
fig, ax = plt.subplots(figsize=(10, 7), facecolor="#cfcfcf")
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc="lower right"),
cmap="Reds",
scheme="Quantiles",
k=5,
)
ax.set_title("Quantiles: k = 5")
ax.set_axis_off()
ax.set_aspect("equal")
## Equal Interval Scheme
fig, ax = plt.subplots(figsize=(10,7), facecolor='#cfcfcf')
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc='lower right'),
cmap="Reds",
scheme="EqualInterval",
k=5
)
ax.set_title("Equal Interval: k = 5")
ax.set_axis_off()
ax.set_aspect("equal")
## Fisher Jenks Scheme
fig, ax = plt.subplots(figsize=(10,7), facecolor='#cfcfcf')
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc='lower right'),
cmap="Reds",
scheme="FisherJenks",
)
ax.set_title("Fisher Jenks: k = 5")
ax.set_axis_off()
ax.set_aspect("equal")
## User Defined Scheme
fig, ax = plt.subplots(figsize=(10,7), facecolor='lightgray')
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc='lower right'),
cmap="Reds",
scheme="UserDefined",
classification_kwds=dict(bins=[100, 300, 500, 800, 1400]) ## NEW: specify user defined bins
)
ax.set_title("User Defined Bins")
ax.set_axis_off()
ax.set_aspect("equal")
The documentation can be found here: https://pysal.org/mapclassify/api.html
Contains the full list of schemes and the function definitions for each.
Better to normalize by area: use the .area attribute of the geometry series
totals['N_per_area'] = totals['size'] / (totals.geometry.area)
Now plot the normalized totals:
# Create the figure
fig, ax = plt.subplots(figsize=(8, 8))
# NEW: Create a nice, lined up colorbar axes (called "cax" here)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.2)
# Plot
totals.plot(
ax=ax,
cax=cax,
edgecolor="white",
linewidth=0.5,
legend=True,
cmap="viridis",
)
# NEW: Get the limits of the GeoDataFrame
xmin, ymin, xmax, ymax = totals.total_bounds
# NEW: Set the xlims and ylims
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
Since households are driving the 311 requests, it would be even better to normalize by the number of properties in a given neighborhood rather than neighborhood area
Hexagonal bins aggregate quantities over small spatial regions.
Use matplotlib's hexbin()
function
# create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Extract out the x/y coordindates of the Point objects
xcoords = trash_requests.geometry.x
ycoords = trash_requests.geometry.y
# Plot a hexbin chart
hex_vals = ax.hexbin(xcoords, ycoords, gridsize=50)
# Add the zillow geometry boundaries
zillow.plot(ax=ax, facecolor="none", edgecolor="white", linewidth=0.25)
# add a colorbar and format
fig.colorbar(hex_vals, ax=ax)
ax.set_axis_off()
ax.set_aspect("equal")
Let's plot a random sample of the requests, with a nice basemap underneath.
We'll use the contextily utility package.
import contextily as ctx
# load the city limits data
city_limits = gpd.read_file('./data/City_Limits')
# create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Plot a random sample of 1,000 requests as points
random_requests = trash_requests.sample(1000)
# Plot
random_requests.plot(ax=ax, marker='.', color='crimson')
# Add the city limits
city_limits.to_crs(trash_requests.crs).plot(ax=ax, edgecolor='black', linewidth=3, facecolor='none')
# NEW: plot the basemap underneath
ctx.add_basemap(ax=ax, crs=trash_requests.crs, source=ctx.providers.CartoDB.Positron)
# remove axis lines
ax.set_axis_off()
Lots of different tile providers available..
Easiest to use tab complete on ctx.providers
ctx.providers
# create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# plot a random sample of requests
trash_requests.sample(1000).plot(ax=ax, marker='.', color='crimson')
# add the city limits
city_limits.to_crs(trash_requests.crs).plot(ax=ax, edgecolor='white', linewidth=3, facecolor='none')
# plot the basemap underneath
ctx.add_basemap(ax=ax, crs=trash_requests.crs, source=ctx.providers.CartoDB.DarkMatter) # NEW: DarkMatter
# remove axis lines
ax.set_axis_off()
Yes! Let's add interactivity. We'll start with altair
...
Altair recently add full support for GeoDataFrames
, making interactive choropleths very easy to make!
import altair as alt
# IMPORTANT: Altair needs the GeoDataFrame to be in EPSG:4326
totals_4326 = totals.to_crs(epsg=4326)
# plot map, where variables ares nested within `properties`,
alt.Chart(totals_4326).mark_geoshape(stroke="white").encode(
tooltip=["N_per_area:Q", "ZillowName:N", "size:Q"],
color=alt.Color("N_per_area:Q", scale=alt.Scale(scheme="viridis")),
).properties(width=500, height=400)
Challenge for later: use altair's repeated charts to show several choropleths for different 311 request types at once.
A similar example (using a different dataset) is available in the altair gallery.
Goals: Visualize the property assessment values by neighborhood in Philadelphia, using a
Challenge (if time remaining):
Visualize the highest-valued residential and commercial properties as points on top of a contextily
basemap
2019 property assessment data:
data = pd.read_csv('./data/opa_residential.csv')
data.head()
We'll focus on the market_value
column for this analysis
Remember to set the EPSG of the input data — this data is in the typical lat/lng coordinates (EPSG=4326)
data = data.dropna(subset=['lat', 'lng'])
data["Coordinates"] = gpd.points_from_xy(data["lng"], data["lat"])
data = gpd.GeoDataFrame(data, geometry="Coordinates", crs="EPSG:4326")
len(data)
Use the sjoin()
function.
Make sure you CRS's match before doing the sjoin!
zillow.crs
data = data.to_crs(epsg=3857)
gdata = gpd.sjoin(data, zillow, op='within', how='left')
gdata.head()
Hints:
grouped = gdata.groupby('ZillowName', as_index=False)
median_values = grouped['market_value'].median()
median_values.head()
median_values = zillow.merge(median_values, on='ZillowName')
median_values['market_value'] /= 1e3 # in thousands
median_values.head()
# Create the figure
fig, ax = plt.subplots(figsize=(8, 8))
# Create a nice, lined up colorbar axes (called "cax" here)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.2)
# Plot
median_values.plot(
ax=ax, cax=cax, column="market_value", edgecolor="white", linewidth=0.5, legend=True
)
# Get the limits of the GeoDataFrame
xmin, ymin, xmax, ymax = median_values.total_bounds
# Set the xlims and ylims
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
# Format cax labels
cax.set_yticklabels([f"${val:.0f}k" for val in cax.get_yticks()]);
Hints:
C
and reduce_C_function
of the hexbin()
functionplt.hexbin?
for more helpbins='log'
on the resulting mapNote: you should pass in the raw point data rather than any aggregated data to the hexbin()
function
# Create the axes
fig, ax = plt.subplots(figsize=(10, 8))
# Use the .x and .y attributes
# NOTE: we are passing in the raw point values here!
# Matplotlib is doing the binning and aggregation work for us!
xcoords = gdata.geometry.x
ycoords = gdata.geometry.y
hex_vals = ax.hexbin(
xcoords,
ycoords,
C=gdata.market_value / 1e3,
reduce_C_function=np.median,
bins="log",
gridsize=30,
)
# Add the zillow geometry boundaries
zillow.plot(ax=ax, facecolor="none", edgecolor="white", linewidth=1, alpha=0.5)
# Add a colorbar and format
cbar = fig.colorbar(hex_vals, ax=ax)
ax.set_axis_off()
ax.set_aspect("equal")
# Format cbar labels
cbar.set_ticks([100, 1000])
cbar.set_ticklabels(["$100k", "$1M"]);
# Convert median values to EPSG=4326
median_values_4326 = median_values.to_crs(epsg=4326)
# Plot the map
alt.Chart(median_values_4326).mark_geoshape(stroke="white").encode(
tooltip=["market_value:Q", "ZillowName:N"],
color=alt.Color("market_value:Q", scale=alt.Scale(scheme='viridis'))
).properties(
width=500, height=400
)
They are all located in Center City
sorted_properties = gdata.sort_values(by='market_value', ascending=False)
top20 = sorted_properties.head(n=20)
top20
# Create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Plot the top 20 properties by market value
top20.plot(ax=ax, marker=".", color="crimson", markersize=200)
# Add the city limits (with the same CRS!)
city_limits = city_limits.to_crs(top20.crs)
city_limits.plot(
ax=ax, edgecolor="black", linewidth=3, facecolor="none"
)
# Plot the basemap underneath
ctx.add_basemap(ax=ax, crs=top20.crs, source=ctx.providers.CartoDB.Positron)
# remove axis lines
ax.set_axis_off()
We can use the total_bounds of the top 20 properties dataframe.
# Create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Plot the top 20 properties by market value
top20.plot(ax=ax, marker=".", color="crimson", markersize=200)
# Add the city limits (with the same CRS!)
city_limits = city_limits.to_crs(top20.crs)
city_limits.plot(
ax=ax, edgecolor="black", linewidth=3, facecolor="none"
)
# Set the xlims and ylims using the boundaries
# of the top 20 properties and a padding (in meters)
xmin, ymin, xmax, ymax = top20.total_bounds
PAD = 1000
ax.set_xlim(xmin-PAD, xmax+PAD)
ax.set_ylim(ymin-PAD, ymax+PAD)
# Plot the basemap underneath
ctx.add_basemap(ax=ax, crs=top20.crs, source=ctx.providers.CartoDB.Positron)
# Remove axis lines
ax.set_axis_off()
The axes limits must be set before the call to the ctx.add_basemap()
. This enables contextily to load a basemap with the proper level of zoom!
--> The top 20 properties are actually all contained within just 6 buildings in Center City