Lecture 12A: Predictive Modeling Part 2

Nov 17, 2020

Housekeeping

Schedule Changes

Thanksgiving holiday will alter our schedule a bit:

The Canvas calendar and course home page schedule should reflect these changes.

This week: Predictive modeling continued

Focus: much more hands-on experience with featuring engineering and adding spatial based features

Last time

First, let's setup all of the imports we'll need from scikit learn:

Review: Predicting housing prices in Philadelphia

Load data from the Office of Property Assessment

Let's download data for single-family properties in Philadelphia that had their last sale during 2019.

Sources:

Let's focus on numerical features only first

Run a linear regression model as a baseline:

Run cross-validation on a random forest model:

Test score improved!

The model appears to generalize reasonably well

Note: we should also be optimizing hyperparameters to see if we can find additional improvements!

Which variables were most important?

On to new material...

How to handle categorical data?

We can use a technique called one-hot encoding

Steps:

One-hot encoding in scikit learn

Let's try out the example data of colors:

Let's apply separate transformers for our numerical and categorical columns:

Note: the handle_unknown='ignore' parameter ensures that if categories show up in our training set, but not our test set, no error will be raised.

Initialize the pipeline object, using the column transformer and the random forest regressor

Now, let's fit the model.

Important

Substantial improvement on test set when including ZIP codes

$R^2$ of ~0.30 improved to $R^2$ of ~0.56!

Takeaway: neighborhood based effects play a crucial role in determining housing prices.

Side Note: to fully validate the model we should run $k$-fold cross validation and optimize hyperparameters of the model as well...

This will be part of assignment #7

But how crucial? Let's plot the importances

But first, we need to know the column names! The one-hot encoder created a column for each category type...

Takeaways

Why is feature engineering so important?

Garbage in, garbage out

Takeway: If your input features are poorly designed (for example, completely unrelated to thing you want to predict), then no matter how good your machine learning model is or how well you "train" it, then the model will never be able to do the translation from features to predicted value.

Adding spatial features to the housing price model

Yes, let's add distance-based features

Spatial amenity/disamenity features

The strategy

Examples of new possible features...

Distance from each sale to:

Example #1: 311 Graffiti Calls

Source: https://www.opendataphilly.org/dataset/311-service-and-information-requests

Step 1: Download the data from the CARTO database

We'll only pull data from 2019.

Step 2: Get the x/y coordinates of both datasets

We will need to:

Step 3: Calculate the nearest neighbor distances

For this, we will use the $k$ nearest neighbors algorithm from scikit learn.

For each sale:

*Note: I am using k=5 here without any real justification. In practice, you would want to try a few different k values to try to identify the best value to use.

What did we just calculate?

Can we reproduce these distances?

Use the log of the average distance as the new feature

We'll average over the column axis: axis=1

Let's plot a hex map of the new feature!

Example #2: Subway stops

Use the osmnx package to get subway stops in Philly — we can use the ox.geometries_from_polygon() function.

The stops on the Market-Frankford and Broad St. subway lines!

Now, get the distances to the nearest subway stop

We'll use $k=1$ to get the distance to the nearest stop.

Let's plot a hex map again!

Looks like it worked!

What about correlations?

Let's have a look at the correlations of numerical columns:

Now, let's re-run our model...did it help?

A small improvement!

$R^2$ of ~0.58 improved to $R^2$ of ~0.62

How about the top 30 feature importances now?

Both new spatial features are in the top 5 in terms of importance!

Exercise: How about other spatial features?

Modify the get_xy_from_geometry() function to use the "centroid" of the geometry column.

Note: you can take the centroid of a Point() or Polygon() object. For a Point(), you just get the x/y coordinates back.

Universities

New feature: Distance to the nearest university/college

Parks

New feature: Distance to the nearest park centroid

Notes

City Hall

New feature: Distance to City Hall.

Notes

New Construction Permits

New feature: Distance to the 5 nearest new construction permits from 2019

Notes

Aggravated Assaults

New feature: Distance to the 5 nearest aggravated assaults in 2019

Notes

Abandonded Vehicle 311 Calls

New feature: Distance to the 5 nearest abandoned vehicle 311 calls in 2019

Notes