Lecture 12B: Predictive Modeling Part 2

Nov 19, 2020

Housekeeping

Schedule next week

Thanksgiving holiday will alter our schedule a bit:

The Canvas calendar and course home page schedule should reflect these changes.

Picking up where we left off

We'll start with an exercise to add additional distance-based features to our housing price model...

Spatial amenity/disamenity features

The strategy

Examples of new possible features...

Distance from each sale to:

Load and clean the data:

Add new distance-based features:

Example #1: 311 Graffiti Calls

Source: https://www.opendataphilly.org/dataset/311-service-and-information-requests

Download the data:

Get the x/y coordinates for the grafitti calls:

Run the neighbors algorithm to calculate the new feature:

Example #2: Subway stops

Use the osmnx package to get subway stops in Philly — we can use the ox.geometries_from_polygon() function.

Get the geometry polygon for the Philadelphia city limits:

Use osmnx to query OpenStreetMap to get all subway stations within the city limits:

Get the distance to the nearest subway station ($k=1$):

Now, let's run the model...

Can we improve on this?

Exercise: How about other spatial features?

Example 1: Universities

New feature: Distance to the nearest university/college

Example 2: Parks

New feature: Distance to the nearest park centroid

Notes

Example 3: City Hall

New feature: Distance to City Hall.

Notes

Example 4: New Construction Permits

New feature: Distance to the 5 nearest new construction permits from 2019

Notes

Example 5: Aggravated Assaults

New feature: Distance to the 5 nearest aggravated assaults in 2019

Notes

Example 6: Abandonded Vehicle 311 Calls

New feature: Distance to the 5 nearest abandoned vehicle 311 calls in 2019

Notes

Fit the updated model

More improvement!

Feature importances:

Part 2: Predicting bikeshare demand in Philadelphia

The technical problem: predict bikeshare trip counts for the Indego bikeshare in Philadelphia

The policy question: how to best expand a bikeshare program?

For more info, see this blog post from Pew

Using predictive modeling as a policy tool

What are the key assumptions here?

Most important: adding new stations in new areas will not affect the demand for existing stations.

This allows the results from the predictive model for demand, built on existing stations, to translate to new stations.

The key assumption is that the bikeshare is not yet at full capacity, and riders in new areas will not decrease the demand in other areas.

Is this a good assumption?

Typically, this is a pretty safe assumption. But I encourage you to use historical data to verify it!

Getting trip data for the Indego bike share

The data page also includes the live status of all of the stations in the system.

API endpoint: http://www.rideindego.com/stations/json

Two important columns:

Let's plot the stations, colored by the number of docks

Load all trips from 2018 and 2019

Dependent variable: total trips by starting station

Let's plot it...

Trips are clearly concentrated in Center City...

What features to use?

There are lots of possible options. Generally speaking, the possible features fall into a few different categories:

Let's add a few from each category...

1. Internal characteristics

Let's use the number of docks per stations:

2. Census demographic data

We'll try out percent commuting by car first.

Merge with block group geometries for Philadelphia:

Finally, let's merge the census data into our dataframe of features by spatially joining the stations and the block groups:

Each station gets the census data associated with the block group the station is within.

"Impute" missing values with the median value

Note: scikit-learn contains preprocessing transformers to do more advanced imputation methods. The documentation contains a good description of the options.

In particular, the SimpleImputer object (see the docs) can fill values based on the median, mean, most frequent value, or a constant value.

3. Amenities/disamenities

Let's add two new features:

  1. Distances to the nearest 10 restaurants from Open Street Map
  2. Whether the station is located within Center City

Restaurants

Search https://wiki.openstreetmap.org/wiki/Map_Features for OSM identifier of restaurants

Get x/y values for the stations:

Within the Center City business district

Available from OpenDataPhilly

4. Transportation Network

Create the graph object from the bounding box:

Let's plot the graph using the built-in plotting from osmnx:

Now extract the nodes (intersections) from the graph into a GeoDataFrame:

Now, compute the average distance to the 10 nearest intersections:

Let's plot the stations, coloring by the new feature (distance to intersections):

5. Neighboring Stations

We will add two new features:

  1. The average distance to the nearest 5 stations
  2. The average trip total for the nearest 5 stations

First, find the nearest 5 stations:

Notes

The log of the distances to the 5 nearest stations:

Now, let's add the trip counts of the 5 neighboring stations:

Use the indices returned by the NearestNeighbors() algorithm to identify which stations are the 5 neighbors in the original data

Is there a correlation between trip counts and the spatial lag?

Yes!

We can use seaborn to make a quick plot:

Let's look at the correlations of all of our features

Again, use seaborn to investigate.

Remember: we don't want to include multipl features that are highly correlated in our model.

Let's fit a model!

Just as before, the plan is to:

Perform our test/train split

We'll use a 60%/40% split, due to the relatively small number of stations.

Random forest results

Let's run a simple grid search to try to optimize our hyperparameters.

Try out a few different values for two of the main parameters for random forests: n_estimators and max_depth:

Important: just like last week, we will need to prefix the parameter name with the name of the pipeline step, in this case, "randomforestregressor".

Evaluate the best estimator on the test set:

Evaluate a linear model (baseline) on the test set

Which features were the most important?

From our earlier correlation analysis, we should expect the most important features to be:

Let's analyze the spatial structure of the predictions visually

We'll plot the predicted and actual trip values

Use the test set index (test_set.index) to get the data from the original data frame (bike_data).

This will ensure we will have geometry info (not used in the modeling) for our test data set:

The data frame indices lines up!

Now, make our predictions, and convert them from log to raw counts:

Let's make the plot with side-by-side panels for actual and predicted:

The results are... not great

The good

We are capturing the general trend of within Center City vs. outside Center City

The bad

The values of the trip counts for those stations within Center City do not seem to be well-represented

Can we improve the model?

Yes!

This is a classic example of underfitting, for a few reasons:

Features to investigate:

Other options for bikeshare data

When trying to improve the accuracy of the model, another option is incorporating additional data. In this case, we can look to other cities and include trip data for these cities. Some good places to start: