Week 11B: Supervised Learning with Scikit-Learn

Nov 12, 2020

Reminder

Final Project Groups and Topic Ideas

https://canvas.upenn.edu/courses/1533812/discussion_topics/6292804

The plan for today

Supervised learning with scikit-learn

Example: does money make people happier?

We'll load data compiled from two data sources:

First step: set up the test/train split of input data:

Where we left off: overfitting

As we increase the polynomial degree (add more and more polynomial features) two things happen:

  1. Training accuracy goes way up
  2. Test accuracy goes way down

This is the classic case of overfitting — our model does not generalize well at all.

Regularization to the rescue?

Remember, regularization penalizes large parameter values and complex fits

Let's gain some intuition:

Important

Set up a grid of GDP per capita points to make predictions for:

Takeaways

Recap: what we learned so far

How can we improve?

More feature engineering!

In this case, I've done the hard work for you and pulled additional country properties from the OECD's website.

Decision trees: a more sophisticated modeling algorithm

We'll move beyond simple linear regression and see if we can get a better fit.

A decision tree learns decision rules from the input features:

A decision tree classifier for the Iris data set

More info on the iris data set

Regression with decision trees is similar

For a specific corner of the input feature space, the decision tree predicts an output target value

Decision trees suffer from overfitting

Decision trees can be very deep with many nodes — this will lead to overfitting your dataset!

Random forests: an ensemble solution to overfitting

This is an example of ensemble learning: combining multiple estimators to achieve better overall accuracy than any one model could achieve

Let's split our data into training and test sets again:

Let's check for correlations in our input data

Let's do some fitting...

New: Pipelines support models as the last step!

Establish a baseline with a linear model:

Now fit a random forest:

Which variables matter the most?

Because random forests are an ensemble method with multiple estimators, the algorithm can learn which features help improve the fit the most.

Let's improve our fitting with k-fold cross validation

  1. Break the data into a training set and test set
  2. Split the training set into k subsets (or folds), holding out one subset as the test set
  3. Run the learning algorithm on each combination of subsets, using the average of all of the runs to find the best fitting model parameters

For more information, see the scikit-learn docs

The cross_val_score() function will automatically partition the training set into k folds, fit the model to the subset, and return the scores for each partition.

It takes a Pipeline object, the training features, and the training labels as arguments

Let's do 3-fold cross validation

Takeaway: the random forest model is clearly more accurate

Question: why did I choose to use 100 estimators in the RF model?

This is when cross validation becomes very important

Enter GridSearchCV

A utility function that will:

More info

Let's do a search over the n_estimators parameter and the max_depth parameter:

Make the grid of parameters to search

"[step name]__[parameter name]"

Now let's evaluate!

We'll define a helper utility function to calculate the accuracy in terms of the mean absolute percent error

Linear model results

Random forest results with default parameters

The random forest model with the optimal hyperparameters

Small improvement!

Recap

Part 2: Modeling residential sales in Philadelphia

In this part, we'll use a random forest model and housing data from the Office of Property Assessment to predict residential sale prices in Philadelphia

Machine learning models are increasingly common in the real estate industry

The hedonic approach to housing prices

What contributes to the price of a house?

Note: We'll focus on the first two components in this analysis (and in assignment #7)

Why are these kinds of models important?

Too often, these models perpetuate inequality: low-value homes are over-assessed and high-value homes are under-assessed

Philadelphia's assessments are...not good

Data from the Office of Property Assessment

Let's download data for properties in Philadelphia that had their last sale during 2019.

Sources:

The OPA is messy

Lots of missing data.

We can use the missingno package to visualize the missing data easily.

Let's focus on numerical features only first

Test score slightly better

Model appears to generalize reasonably well!

Note: we should also be optimizing hyperparameters to see if we can find additional improvements!

Which variables were most important?

More next lecture...

That's it!