Week 11A: Supervised Learning with Scikit-Learn

Nov 10, 2020

Housekeeping

Last week

Where we left off: Extending DBSCAN beyond just spatial coordinates

DBSCAN can perform high-density clusters from more than just spatial coordinates, as long as they are properly normalized

Exercise: Extracting patterns from NYC taxi rides

I've extracted data for taxi pickups or drop offs occurring in the Williamsburg neighborhood of NYC from the NYC taxi open data.

Includes data for:

Goal: identify clusters of similar taxi rides that are not only clustered spatially, but also clustered for features like hour of day and trip distance

Inspired by this CARTO blog post

Step 1: Load the data

Step 2: Extract and normalize several features

We will focus on the following columns:

Use the StandardScaler to normalize these features.

Step 3: Run DBSCAN to extract high-density clusters

Hint: If the algorithm is taking a long time to run (more than a few minutes), the eps is probably too big!

Step 4: Identify the 5 largest clusters

Group by the label, calculate and sort the sizes to find the label numbers of the top 5 largest clusters

Step 5: Get mean statistics for the top 5 largest clusters

To better identify trends in the top 5 clusters, calculate the mean trip distance and pickup_hour for each of the clusters.

Step 6a: Visualize the top 5 largest clusters

Now visualize the top 5 largest clusters:

Hints:

Step 6b: Visualizing one cluster at a time

Another good way to visualize the results is to explore the other clusters one at a time, plotting both the pickups and dropoffs to identify the trends.

Use different colors for pickups/dropoffs to easily identify them.

Make it a function so we can repeat it easily:

Interpreting clustering results: the perils of algorithms

Algorithmic bias

An example from the CARTO analysis:

We wanted to explore how we can use data to better understand and define communities of people, going beyond spatial borders like zip code and neighborhood boundaries.

However, their clusters include groupings by class and religion...e.g, "working class" and "Orthodox Jewish" residents. While the intention of this analysis may have been benign, the results could have easily been misused to target residents in a potentially discriminatory way.

We'll see more examples of algorithmic fairness on assignment #7 when modeling housing prices in Philadelphia.

Now onto new material...

Reminder: clustering is an example of unsupervised learning

Today: an example of supervised learning

Examples

Today, we'll walk through an end-to-end regression example to predict Philadelphia's housing prices

Model-based learning

Machine learning is really just an optimization problem

Given your training set of data, which model parameters best represent the observed data?

1. Choose a model

2. The model has an associated cost function

The cost function measures the difference between the model's predictions and the observed data

3. "Learn" the best model parameters

In scikit-learn, you will call the fit() method on your algorithm.

Recap: the steps involved

  1. Wrangle and understand data.
  2. Select a model that would be best for your dataset.
  3. Train the model on the training data — the learning algorithm searches for the best model parameters
  4. Apply the model to new data to make predictions.

Key goal: how we can do this in a way to ensure the model is as generalizable as possible?

What could go wrong?

Mistake #1: "bad data"

Or: "garbage in, garbage out"

Common issues:

Mistake #2: "bad algorithm"

Regularization: keeping it simple

Key question: How do we know if a model will perform well on new data?

Option #1: a train/test split

Common to use 80% of data for your training set and 20% for your test set

Option #2: k-fold cross-validation

  1. Break the data into a training set and test set
  2. Split the training set into k subsets (or folds), holding out one subset as the test set
  3. Run the learning algorithm on each combination of subsets, using the average of all of the runs to find the best fitting model parameters

For more information, see the scikit-learn docs

Let's try out a simple example: does money make people happier?

We'll load data compiled from two data sources:

Make a quick plot

There's a roughly linear trend here...let's start there

A simple model with only two parameters: $\theta_1$ and $\theta_2$

Use the LinearRegression model object from scikit-learn.

This is not really machine learning — it simply finds the Ordinary Least Squares fit to the data.

Note: scikit learn expects the features to be a 2D array with shape: (number of observations, number of features).

We are explicitly adding a second axis with the np.newaxis variable.

Now, fit the model using the model.fit(X, y) syntax.

This will "train" our model, using an optimization algorithm to identify the bestfit parameters.

Reminder: What's with the "_" at the end of variable names?

These represent "estimated" properties of the model — this is how scikit learn signals to the user that these attributes depend on the fit() function being called beforehand.

More info here.

Note: In this case, our model is the same as ordinary least squares, and no actually optimization is performed since an exact solution exists.

How good is the fit?

Note: you must call the fit() function before calling the score() function.

Let's plot the data and the predicted values

Use the predict() function to predict new values.

Not bad....but what did we do wrong?

1. We also fit and evaluated our model on the same training set!

2. We didn't scale our input data features!

Scikit learn provides a utility function to split our input data:

These are new DataFrame objects, with lengths determined by the split percentage:

Now, make our feature and label arrays:

Use the StandardScaler to scale the GDP per capita:

Now, let's fit on the training set and evaluate on the test set

Unsurprisingly, our fit gets worst when we test on unseen data

Our accuracy was artifically inflated the first time, since we trained and tested on the same data.

Can we do better? Let's do some feature engineering...

We'll use scikit learn's PolynomialFeatures to add new polynomial features from the GDP per capita.

Let's try up to degree 3 polynomials ($x^3$)

Now we have two transformations to make:

  1. Scale our features
  2. Create the polynomial features

The accuracy improved!

Pipelines: making multiple transformations much easier

We can turn our preprocessing steps into a Pipeline object using the make_pipeline() function.

Individual steps can be accessed via their names in a dict-like fashion:

Let's apply this pipeline to our predicted GDP values for our plot:

The additional polynomial features introduced some curvature and improved the fit!

How about large polynomial degrees?

Overfitting alert!

As we increase the polynomial degree, two things happen:

  1. Training accuracy goes way up
  2. Test accuracy goes way down

This is the classic case of overfitting — our model does not generalize well at all.

Regularization to the rescue?

Remember, regularization penalizes large parameter values and complex fits