Week 3: Geospatial Data Analysis and GeoPandas¶

Sep 15, 2020

Housekeeping¶

Homework #2 due a week of Thursday (9/24)
Choose a dataset to visualize and explore
- OpenDataPhilly or one your choosing
- Email me if you want to analyze one that's not on OpenDataPhilly

Agenda for Week #3¶

Vector data and introduction to GeoPandas
Spatial relationships and joins
Visualization for geospatial data
Demo: 311 requests by neighborhood in Philadelphia
Exercise: Property assessments by neighborhood

Now, on to geospatial analysis...¶

# Let's setup the imports we'll need first
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

%matplotlib inline

Vector Data¶

Vector refers to discrete geometric entities
The Open Geospatial Consortium has standardized a set of simple features
Includes points, lines, and polygons

Source

A couple of terminology notes¶

A feature refers to both the geometry and attributes of specific piece of vector data
A feature collection is a list, or collection, of features

Both terms are very common in Python geospatial software.

Common formats for vector datasets¶

A shapefile¶

Actually several files with the same common prefix

Mandatory files:

.shp: the file containing the geometries
.shx: the file that indexes the geometry
.dbf: tabular data format storing the attributes for each geometry

And many optional files for documentation, projection information, etc.

Let's take a look at an example shapefile:

We'll use the ls command to list out all of the files in an example shapefile in the data/ folder

%ls "data/ne_110m_admin_0_countries/"

ne_110m_admin_0_countries.cpg  ne_110m_admin_0_countries.shp
ne_110m_admin_0_countries.dbf  ne_110m_admin_0_countries.shx
ne_110m_admin_0_countries.prj

The GeoJSON file¶

Stores simple features in a JSON format
Arose due to the prevalence of the JSON format, especially on the web

Additional GeoJSON resources and tools:¶

GitHub lets you view GeoJSON files natively
http://geojson.io provides interactive creation and viewing of small GeoJSON files

GitHub example from the data/ directory: Philadelphia ZIP Codes

Working with vector data in Python: GeoPandas¶

geopandas provides a simple, intuitive for the main types of geospatial vector file formats

Example: Let's load a shape file of countries in the world...

Source: Natural Earth Data

import geopandas as gpd

We can use the read_file() function to read shapefiles and GeoJSON files.

# Read the shape file, giving the name of the directory
countries = gpd.read_file("./data/ne_110m_admin_0_countries")

countries.head()

type(countries)

geopandas.geodataframe.GeoDataFrame

What's a `GeoDataFrame`?¶

Just like a DataFrame but with a geometry column

# Print out the first 10 entires of the "geometry" column
countries.geometry.head(n=10)

0    POLYGON ((61.211 35.650, 62.231 35.271, 62.985...
1    MULTIPOLYGON (((23.904 -11.722, 24.080 -12.191...
2    POLYGON ((21.020 40.843, 21.000 40.580, 20.675...
3    POLYGON ((51.580 24.245, 51.757 24.294, 51.794...
4    MULTIPOLYGON (((-66.960 -54.897, -67.562 -54.8...
5    POLYGON ((43.583 41.092, 44.972 41.248, 45.179...
6    MULTIPOLYGON (((-59.572 -80.040, -59.866 -80.5...
7    POLYGON ((68.935 -48.625, 69.580 -48.940, 70.5...
8    MULTIPOLYGON (((145.398 -40.793, 146.364 -41.1...
9    POLYGON ((16.980 48.123, 16.904 47.715, 16.341...
Name: geometry, dtype: geometry

We can still leverage the power of `pandas`...¶

Calculate the total world population:

countries['pop_est'].sum()/1e9 # In billions

7.383089462

Calculate the total population on each continent:

grouped = countries.groupby('continent')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc0b6b0b790>

REMEMBER:
Remember: the groupby() does not return a DataFrame — you need to call sum(), mean(), etc, or apply() a function.

# Sum population on each continent
pop_by_continent = grouped['pop_est'].sum()

# Sort values
pop_by_continent.sort_values(ascending=False, inplace=True)

# Output sorted values from cell
pop_by_continent

continent
Asia                       4.389145e+09
Africa                     1.219176e+09
Europe                     7.463985e+08
North America              5.730421e+08
South America              4.185407e+08
Oceania                    3.678284e+07
Antarctica                 4.050000e+03
Seven seas (open ocean)    1.400000e+02
Name: pop_est, dtype: float64

Filter the data frame based on a boolean selection:

# Is the country name USA?
is_USA = countries['name']=='United States of America'

# Get the row with USA
USA = countries.loc[is_USA]

USA

An aside: the `squeeze()` function¶

It does just one it sounds like: if you have a DataFrame with only one row, it will "squeeze" the row dimension by removing it, returning just a Series object:

# Squeeze
USA = USA.squeeze()

# Print out the type
print("The type of USA is: ", type(USA))

# Output
USA

The type of USA is:  <class 'pandas.core.series.Series'>

iso_a3                                                      USA
name                                   United States of America
continent                                         North America
pop_est                                             3.26626e+08
gdp_md_est                                            1.856e+07
geometry      (POLYGON ((-122.84 49.00000000000014, -120 49....
Name: 168, dtype: object

The simple features (Lines, Points, Polygons) are implemented by the shapely library

type(USA.geometry)

shapely.geometry.multipolygon.MultiPolygon

Jupyter notebook renders shapely geometries automatically:

# a mini USA
USA.geometry

How does `geopandas` handle coordinate systems and map projections?¶

Coordinate Reference Systems¶

A coordinate reference system (CRS) relates the position of a geometry object on the spherical earth to its two-dimensional coordinates.

A GeoDataFrame or GeoSeries has a .crs attribute which specifies the coordinate reference system.

countries.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

EPSG 4326 is known as WGS 84 where x and y are longitude and latitude.
It is is the default coordinate system for GPS systems.
It's also known as Plate Carrée

How to plot all of the geometries at once?¶

Use the plot() function to get a quick and dirty plot of all of the geometry features.

Note: the plot() returns the current maplotlib axes, allowing you to format the chart after plotting.

# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the countries on our axes
ax = countries.plot(ax=ax, facecolor="none", edgecolor="black")

# Add a title
ax.set_title("Equirectangular Projection")

Text(0.5, 1.0, 'Equirectangular Projection')

What's going on under the hood?¶

matplotlib and cartopy are combined to make geo-aware plots

import cartopy.crs as ccrs

# Initialize the EPSG:4326 CRS object
wgs84 = ccrs.PlateCarree()

# Create a geo-aware axes using the "projetion" keyword
fig, ax = plt.subplots(figsize=(10, 6), subplot_kw={"projection": wgs84})

# Print out the type of the axes for info purposes
print("The type of the axes is: ", type(ax))

# Add the geometry shapes
ax.add_geometries(countries["geometry"], crs=wgs84, facecolor="none", edgecolor="black")

# Add a title
ax.set_title("Equirectangular Projection");

The type of the axes is:  <class 'cartopy.mpl.geoaxes.GeoAxesSubplot'>

See the Geopandas documentation for more examples: Plotting with CartoPy and GeoPandas

Can we convert to other geometries?¶

Use the to_crs() function. The most well-known projections can be specified by their EPSG code.

Geopandas documentation on re-projecting: Managing Projections

Let's convert to the Mercator projection¶

EPSG code: 3395

# Remove Antartica, as the Mercator projection 
# cannot deal with the poles
no_antartica = countries[(countries['name'] != "Antarctica")]

# Two ways to specify the EPSG code
countries_mercator = no_antartica.to_crs(epsg=3395)

# Alternatively:
# countries_mercator = no_antartica.to_crs("EPSG:3395")

countries_mercator.head()

Note: the magnitude of the values in the geometry column changed! A quick and easy way to tell if the re-projection worked properly!

Now let's plot it¶

The easy way...with geopandas built-in plot() function

# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# Use built-in plot() of the GeoDataFrame
ax = countries_mercator.plot(ax=ax, facecolor="none", 
                             edgecolor="black")

# Add a title
ax.set_title("Mercator Projection");

The harder way...¶

With cartopy and matplotlib

# Initialize the CRS object
crs = ccrs.epsg(3395) # or crs = ccrs.Mercator()

# Create a geo-aware axes using the "projetion" keyword
fig, ax = plt.subplots(figsize=(10, 6), 
                       subplot_kw={"projection": crs})

# Add the geometry shapes
ax.add_geometries(countries_mercator['geometry'], 
                  crs=crs, facecolor='none', edgecolor='black')

# Add a title
ax.set_title('Mercator Projection');

When to use the 'harder' way?¶

When you need more customizable, advanced plots.

Nearly anything that matplotlib can do can be plotted on a cartopy GeoAxes. Plotting directly with matplotlib allows you to take full advantage of matplotlib's functionality.

So which CRS is best?¶

For city-based data, usually Web Mercator (EPSG=3857) is best
Can also use a CRS specific to individual states, e.g., PA State Plane EPSG=2272

Let's load the city limits for Philadelphia¶

We'll use the provided City_Limits shape file in the data/ folder

city_limits = gpd.read_file('./data/City_Limits')

city_limits

What's the CRS?¶

Use the .crs attribute to find out!

city_limits.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

# Create our figure and axes
fig, ax = plt.subplots(figsize=(5, 5))

# Plot
city_limits.plot(ax=ax, facecolor="none", edgecolor="black")

# Format
ax.set_title("Equirectangular")
ax.set_axis_off() # This will remove the axes completely
ax.set_aspect("equal") # This forces an equal aspect ratio

This is not what Philadelphia looks like..

Let's try EPSG=3857 instead:

# Create the figure
fig, ax = plt.subplots(figsize=(5, 5))

# Convert to EPSG:3857
city_limits_3857 = city_limits.to_crs(epsg=3857)

# Plot and format
city_limits_3857.plot(ax=ax, facecolor="none", edgecolor="black")

ax.set_title("Web Mercator")
ax.set_axis_off()
ax.set_aspect("equal");

Important: the equirectangular CRS (EPSG=4326) is often used by default and will make cities appear wider and flatter than they really are

Saving GeoDataFrames¶

Use the to_file() function and specify the driver.

# ESRI shape file
city_limits_3857.to_file("./data/city_limits_3857", 
                         driver='ESRI Shapefile')

# GeoJSON is also an option
city_limits_3857.to_file("./data/city_limits_3857.geojson", 
                         driver='GeoJSON')

How about as a CSV file?¶

Yes, but reading requires more work...

# save a csv file
city_limits_3857.to_csv("./data/city_limits_3857.csv", index=False)

df = pd.read_csv("./data/city_limits_3857.csv")
df.head()

Looks similar...¶

But, the "geometry" column is just stored as a string...it's not a shapely Polygon

type(df.geometry.iloc[0])

str

Use `shapely` to parse the string version of the polygons¶

from shapely import wkt

# wkt.loads will convert from string to Polygon object
df['geometry'] = df['geometry'].apply(wkt.loads)

type(df.geometry.iloc[0])

shapely.geometry.polygon.Polygon

Converting from a DataFrame to a GeoDataFrame¶

We can initialize the GeoDataFrame directly from a DataFrame but we need to specify two things:

The name of the "geometry" column
The CRS of the "geometry" column

In this case, the geometry column was saved in Web Mercator EPSG=3857

# Make specifying the name of the geometry column and CRS
gdf= gpd.GeoDataFrame(df, geometry='geometry', crs="EPSG:3857")

# Now plot
fig, ax = plt.subplots(figsize=(5,5))
ax = gdf.plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_axis_off()
ax.set_aspect("equal")

Let's convert back to 4326 and plot¶

The tilt should be a bit more obvious now...

ax = gdf.to_crs(epsg=4326).plot(facecolor='none', edgecolor='black')
 
ax.set_axis_off()
ax.set_aspect("equal")

Note¶

I didn't use plt.subplots() here to create a figure/axes – I let geopandas automatically make one
I've chained together the to_crs() and .plot() functions in one line
The .plot() function returns the axes object that geopandas used to plot — this lets you customizes the axes after plotting

So, when should you use GeoPandas?¶

For exploratory data analysis and visualization, including in Jupyter notebooks
Pre-processing data to be fed into a desktop GIS program
For compact, readable, and reproducible code
If you’re comfortable with Pandas and/or R data frames.

When it may not be the best tool:¶

For polished multilayer map creation — one option is to use a desktop GIS like QGIS.
If you need very high performance — geopandas can be slow compared to other GIS software.

Spatial Relationships and Joins¶

# Load some cities data
cities = gpd.read_file("./data/ne_110m_populated_places")

(Image by Krauss, CC BY-SA 3.0)

All of these operations are available as functions of a GeoDataFrame.

A quick example¶

What country is New York in?

Spoiler: the USA

# Select the Point representing New York City
new_york = cities.loc[cities['name'] == 'New York', 'geometry'].squeeze()
new_york

type(new_york)

shapely.geometry.point.Point

countries.contains(new_york)

0      False
1      False
2      False
3      False
4      False
       ...  
172    False
173    False
174    False
175    False
176    False
Length: 177, dtype: bool

# Find the country that contains New York
countries.loc[countries.contains(new_york)]

# Get the geometry column of the country containing NYC
USA = countries.loc[countries.contains(new_york), 'geometry'].squeeze()
USA

Note¶

The .loc[] function can take the index selector as the first argument, and the name of a column as a second argument (separated by a comma)

type(USA)

shapely.geometry.multipolygon.MultiPolygon

# Is New York within the USA?
new_york.within(USA)

True

Reference¶

The different functions for checking spatial relationships:

equals
contains
crosses
disjoint
intersects

overlaps
touches
within
covers

See the shapely documentation for an overview of these methods.

The spatial join¶

SPATIAL JOIN = merging attributes from two geometry layers based on their spatial relationship

Different parts of this operations:

The GeoDataFrame to which we want add information
The GeoDataFrame that contains the information we want to add
The spatial relationship we want to use to match both datasets (intersects, contains, within)
The type of join: left or inner join

In this case, we want to join the cities dataframe, containing Point geometries, with the information of the countries dataframe, containing Polygon geometries.

To match cities with countries, we'll use the within spatial relationship.

The geopandas.sjoin() function performs this operation:

joined = gpd.sjoin(cities, countries, op='within', how='left')

joined.head()

# Extract Italy
italy = countries.loc[countries['name']=='Italy']

# Plot
fig, ax = plt.subplots(figsize=(8,8))
italy.plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_axis_off()
ax.set_aspect("equal")

# Plot the first city in the joined data frame (Vatican City)
# Use the same axes by passing in the ax=ax keyword
ax = joined.iloc[:1].plot(ax=ax, color='red')

Spatial overlay operation¶

We can also perform the join() operation on the geometries rather than just combining attributes.

The overlay() function combines geometries, e.g. by taking the intersection of the geometries.

africa = countries[countries['continent'] == 'Africa']

## What crs?
africa.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

# Let's transform to a CRS that uses meters 
# instead of degrees (EPSG=3857)
africa = africa.to_crs(epsg=3857)

africa.crs

<Projected CRS: EPSG:3857>
Name: WGS 84 / Pseudo-Mercator
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: World - 85°S to 85°N
- bounds: (-180.0, -85.06, 180.0, 85.06)
Coordinate Operation:
- name: Popular Visualisation Pseudo-Mercator
- method: Popular Visualisation Pseudo Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

fig, ax = plt.subplots(figsize=(8,8))

africa.plot(ax=ax, facecolor='#b9f2b1')

ax.set_axis_off()
ax.set_aspect("equal")

# Important CRS needs to match!
cities = cities.to_crs(epsg=3857)

# Create a copy of the GeoDataFrame
buffered_cities = cities.copy()

# Add a buffer region of 250 km around all cities
buffered_cities['geometry'] = buffered_cities.buffer(250e3)

Plot the difference of the two geometries¶

fig, ax = plt.subplots(figsize=(8, 8))

# Calculate the difference of the geometry sets
diff = gpd.overlay(africa, buffered_cities, how='difference')

# Plot
diff.plot(facecolor="#b9f2b1", ax=ax)
ax.set_axis_off()
ax.set_aspect("equal")

# Data attributes are the same as the first data frame (africa) 
# with an updated geometry column
diff.head()

Plot the intersection of the two geometries¶

fig, ax = plt.subplots(figsize=(8, 8))

# The intersection of the geometry sets
intersection = gpd.overlay(africa, buffered_cities, how='intersection')

# Plot
intersection.plot(ax=ax, facecolor="#b9f2b1")
ax.set_axis_off()
ax.set_aspect("equal")

Recap:¶

Spatial join: merge attributes from one data frame to another based on the spatial relationship
Spatial overlay: creating new geometries based on spatial operation between both data frames (and not combining attributes of both data frames)

Putting it all together: 311 requests in 2020¶

Load 311 requests in Philadelphia from the data/ directory.

Source: OpenDataPhilly

# Load the data from a CSV file into a pandas DataFrame
requests = pd.read_csv('./data/public_cases_fc_2020.csv')

/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (12) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

print("number of requests = ", len(requests))

number of requests =  562549

requests.head()

First, convert to a GeoDataFrame¶

Remove the requests missing lat/lon coordinates

requests = requests.dropna(subset=['lat', 'lon'])

Create Point objects for each lat and lon combination.

We can use the helper utility function: geopandas.points_from_xy()

requests['Coordinates'] = gpd.points_from_xy(requests['lon'], requests['lat'])

requests['Coordinates'].head()

2     POINT (-75.27059 39.97706)
9     POINT (-75.24549 39.92376)
14    POINT (-75.16257 40.04816)
19    POINT (-75.18500 40.03733)
22    POINT (-75.20961 39.94040)
Name: Coordinates, dtype: geometry

Now, convert to a GeoDataFrame.

Important

Don't forget to set the CRS manually!
The CRS you specify when creating a GeoDataFrame should tell geopandas what the coordinate system the input data is in.
Usually you will be reading lat/lng coordinates, and will need to specify the crs as EPSG code 4326
You should specify the crs as a string using the syntax: ESPG:4326

Since we're only using a few EPSG codes in this course, you can usually tell what the CRS is by looking at the values in the Point() objects.

Philadelphia has a latitude of about 40 deg and longitude of about -75 deg.

Our data must be in the usual lat/lng EPSG=4326.

Screen%20Shot%202020-09-12%20at%204.15.11%20PM.png

requests = gpd.GeoDataFrame(requests, geometry="Coordinates", crs="EPSG:4326")

Next, identify the top 20 most common requests¶

Group by the service name and calculate the size of each group:

service_types = requests.groupby('service_name').size()

Sort by the number (in descending order):

service_types = service_types.sort_values(ascending=False)

Slice the data to take the first 20 elements:

top20 = service_types.iloc[:20]
top20

service_name
Rubbish/Recyclable Material Collection    40776
Illegal Dumping                           19246
Maintenance Complaint                     17969
Abandoned Vehicle                         15933
Information Request                       15240
Graffiti Removal                          11652
Street Light Outage                       10107
Street Defect                              7370
Fire Safety Complaint                      4874
Street Trees                               4450
Sanitation / Dumpster Violation            4244
Construction Complaints                    3859
Agency Receivables                         3451
Maintenance Residential or Commercial      3389
Other (Streets)                            2922
LI Escalation                              2807
Parks and Rec Safety and Maintenance       2228
Traffic Signal Emergency                   2157
Complaint (Streets)                        2100
Alley Light Outage                         1650
dtype: int64

Let's trim to only the trash-related requests¶

trash_requests = requests.loc[
    requests["service_name"] == "Rubbish/Recyclable Material Collection"
]

print("The nuumber of trash-related requests = ", len(trash_requests))

The nuumber of trash-related requests =  40776

Trash collection has been a BIG issue in Philadelphia recently¶

See for example, this article in the Philadelphia Inquirer

Let's plot the monthly totals for 2020¶

# Convert the requested datetime to a column of Datetime objects
trash_requests['requested_datetime'] = pd.to_datetime(trash_requests['requested_datetime'])

# Use the .dt attribute to extract out the month name
trash_requests['month'] = trash_requests['requested_datetime'].dt.month_name()

/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/geopandas/geodataframe.py:853: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super(GeoDataFrame, self).__setitem__(key, value)

totals_by_week = trash_requests.groupby("month", as_index=False).size()

totals_by_week.head()

Note: I've used the `as_index=False` syntax here¶

This will force the size() function to return a DataFrame instead of having the month column as the index of the resulted groupby operation.

It saves us from having to do the .reset_index() function call after running the .size() function.

Plot a bar chart with seaborn¶

For making static bar charts with Python, seaborn's sns.barplot() is the best option

import seaborn as sns

# Initialize figure/axes
fig, ax = plt.subplots(figsize=(12, 6))

# Plot!
sns.barplot(
    x="month",
    y="size",
    data=totals_by_week,
    color="#2176d2",
    ax=ax,
    order=["January", "February", "March", "April", "May", "June", "July", "August", "September"],
);

Example: Improving the aesthetics of matplotlib¶

The trend is clear in the previous chart, but can we do a better job with the aesthetics? Yes!

For reference, here is a common way to clean up charts in matplotlib:

# Initialize figure/axes
fig, ax = plt.subplots(figsize=(12, 6))

# Plot!
sns.barplot(
    x="month",
    y="size",
    data=totals_by_week,
    color="#2176d2",
    ax=ax,
    order=["January", "February", "March", "April", "May", "June", "July", "August", "September"],
    zorder=999 # Make sure the bar charts are on top of the grid
)

# Remove x/y axis labels
ax.set_xlabel("")
ax.set_ylabel("")

# Format the ytick labels to use a comma and no decimal places
ax.set_yticklabels([f"{yval:,.0f}" for yval in ax.get_yticks()] )

# Add a grid backgrou d
ax.grid(True, axis='y')

# Remove the top and right axes lines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Add a title
ax.set_title("Philadelphia's Trash-Related 311 Requests in 2020", weight='bold', fontsize=16);

/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/ipykernel_launcher.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator

	iso_a3	name	continent	pop_est	gdp_md_est	geometry
0	AFG	Afghanistan	Asia	34124811.0	64080.0	POLYGON ((61.21082 35.65007, 62.23065 35.27066...
1	AGO	Angola	Africa	29310273.0	189000.0	MULTIPOLYGON (((23.90415 -11.72228, 24.07991 -...
2	ALB	Albania	Europe	3047987.0	33900.0	POLYGON ((21.02004 40.84273, 20.99999 40.58000...
3	ARE	United Arab Emirates	Asia	6072475.0	667200.0	POLYGON ((51.57952 24.24550, 51.75744 24.29407...
4	ARG	Argentina	South America	44293293.0	879400.0	MULTIPOLYGON (((-66.95992 -54.89681, -67.56244...

	iso_a3	name	continent	pop_est	gdp_md_est	geometry
0	AFG	Afghanistan	Asia	34124811.0	64080.0	POLYGON ((6813956.990 4227673.562, 6927484.435...
1	AGO	Angola	Africa	29310273.0	189000.0	MULTIPOLYGON (((2660998.216 -1305442.810, 2680...
2	ALB	Albania	Europe	3047987.0	33900.0	POLYGON ((2339940.185 4961221.199, 2337708.178...
3	ARE	United Arab Emirates	Asia	6072475.0	667200.0	POLYGON ((5741805.754 2765811.385, 5761611.935...
4	ARG	Argentina	South America	44293293.0	879400.0	MULTIPOLYGON (((-7453944.198 -7306880.704, -75...

	name_left	geometry	index_right	iso_a3	name_right	continent	pop_est	gdp_md_est
0	Vatican City	POINT (12.45339 41.90328)	79.0	ITA	Italy	Europe	62137802.0	2221000.0
1	San Marino	POINT (12.44177 43.93610)	79.0	ITA	Italy	Europe	62137802.0	2221000.0
2	Vaduz	POINT (9.51667 47.13372)	9.0	AUT	Austria	Europe	8754413.0	416600.0
3	Lobamba	POINT (31.20000 -26.46667)	152.0	SWZ	Swaziland	Africa	1467152.0	11060.0
4	Luxembourg	POINT (6.13000 49.61166)	97.0	LUX	Luxembourg	Europe	594130.0	58740.0

	iso_a3	name	continent	pop_est	gdp_md_est	geometry
1	AGO	Angola	Africa	29310273.0	189000.0	MULTIPOLYGON (((2660998.216 -1314118.413, 2680...
13	BEN	Benin	Africa	11038805.0	24310.0	POLYGON ((183878.692 948685.104, 185288.796 10...
14	BFA	Burkina Faso	Africa	20107509.0	32990.0	MULTIPOLYGON (((155663.935 1282381.422, 138422...
25	BWA	Botswana	Africa	2214858.0	35900.0	POLYGON ((3276376.220 -2522491.957, 3118864.44...
26	CAF	Central African Rep.	Africa	5625118.0	3206.0	POLYGON ((1700901.761 828525.199, 1792937.514 ...

	objectid	service_request_id	status	status_notes	service_name	service_code	agency_responsible	service_notice	requested_datetime	updated_datetime	expected_datetime	address	zipcode	media_url	lat	lon
0	7890359	13127945	Closed	Question Answered	Information Request	SR-IR01	Police Department	NaN	2020-02-05	2020-02-05	2020-02-05	NaN	NaN	NaN	NaN	NaN
1	8433329	13376073	Closed	NaN	Information Request	SR-IR01	License & Inspections	NaN	2020-05-22	2020-05-22	2020-05-22	NaN	NaN	NaN	NaN	NaN
2	8421006	13370944	Open	NaN	Abandoned Vehicle	SR-PD01	Police Department	60 Business Days	2020-05-20	2020-05-20	2020-08-20	1826 PENNINGTON RD	NaN	NaN	39.977058	-75.270591
3	8433331	13376078	Closed	NaN	Information Request	SR-IR01	Streets Department	NaN	2020-05-22	2020-05-22	2020-05-22	NaN	NaN	NaN	NaN	NaN
4	8288384	13325114	Closed	Question Answered	Information Request	SR-IR01	Department of Records	NaN	2020-04-30	2020-04-30	2020-04-30	NaN	NaN	NaN	NaN	NaN

	month	size
0	April	5773
1	August	4414
2	February	2067
3	January	2710
4	July	9622

Week 3: Geospatial Data Analysis and GeoPandas¶

Housekeeping¶

Agenda for Week #3¶

Now, on to geospatial analysis...¶

Vector Data¶

A couple of terminology notes¶

Common formats for vector datasets¶

A shapefile¶

The GeoJSON file¶

Additional GeoJSON resources and tools:¶

Working with vector data in Python: GeoPandas¶

What's a GeoDataFrame?¶

We can still leverage the power of pandas...¶

An aside: the squeeze() function¶

How does geopandas handle coordinate systems and map projections?¶

Coordinate Reference Systems¶

How to plot all of the geometries at once?¶

What's going on under the hood?¶

Can we convert to other geometries?¶

Let's convert to the Mercator projection¶

Now let's plot it¶

The harder way...¶

When to use the 'harder' way?¶

So which CRS is best?¶

Let's load the city limits for Philadelphia¶

What's the CRS?¶

Saving GeoDataFrames¶

How about as a CSV file?¶

Looks similar...¶

Use shapely to parse the string version of the polygons¶

Converting from a DataFrame to a GeoDataFrame¶

Let's convert back to 4326 and plot¶

Note¶

So, when should you use GeoPandas?¶

When it may not be the best tool:¶

Spatial Relationships and Joins¶

A quick example¶

Note¶

Reference¶

The spatial join¶

Spatial overlay operation¶

Plot the difference of the two geometries¶

Plot the intersection of the two geometries¶

Recap:¶

Putting it all together: 311 requests in 2020¶

First, convert to a GeoDataFrame¶

Next, identify the top 20 most common requests¶

Let's trim to only the trash-related requests¶

Trash collection has been a BIG issue in Philadelphia recently¶

Let's plot the monthly totals for 2020¶

Note: I've used the as_index=False syntax here¶

Plot a bar chart with seaborn¶

Example: Improving the aesthetics of matplotlib¶

What's a `GeoDataFrame`?¶

We can still leverage the power of `pandas`...¶

An aside: the `squeeze()` function¶

How does `geopandas` handle coordinate systems and map projections?¶

Use `shapely` to parse the string version of the polygons¶

Note: I've used the `as_index=False` syntax here¶