Lecture 10B: Clustering Analysis in Python

Nov 5, 2020

Housekeeping

Last lecture (10A)

Today

Part 1: Non-spatial clustering

The goal

Partition a dataset into groups that have a similar set of attributes, or features, within the group and a dissimilar set of features between groups.

Minimize the intra-cluster variance and maximize the inter-cluster variance of features.

Some intuition

K-Means clustering

Example: Clustering countries by health and income

Read the data from a URL:

K-Means with scikit-learn

Normalizing features with the pre-processing module

Use the fit_transform() function to scale your features

Now fit the scaled features

Exercise: Clustering neighborhoods by Airbnb stats

I've extracted neighborhood Airbnb statistics for Philadelphia neighborhoods from Tom Slee's website.

The data includes average price per person, overall satisfaction, and number of listings.

Two good references for Airbnb data

Original research study: How Airbnb's Data Hid the Facts in New York City

Step 1: Load the data with pandas

The data is available in CSV format ("philly_airbnb_by_neighborhoods.csv") in the "data/" folder of the repository.

Step 2: Perform the K-Means fit

Step 3: Calculate average features per cluster

To gain some insight into our clusters, after calculating the K-Means labels:

Step 4: Plot a choropleth, coloring neighborhoods by their cluster label

Step 5: Plot an interactive map

Use altair to plot the clustering results with a tooltip for neighborhood name and tooltip.

Hint: See week 3B's lecture on interactive choropleth's with altair

Based on these results, where would you want to stay?

Cluster #3 seems like the best bang for your buck!

Part 2: Spatial clustering

Now on to the more traditional view of "clustering"...

DBSCAN

"Density-Based Spatial Clustering of Applications with Noise"

Two key parameters

  1. eps: The maximum distance between two samples for them to be considered as in the same neighborhood.
  2. min_samples: The number of samples in a neighborhood for a point to be considered as a core point (including the point itself).

Example Scenario

DBSCAN-Illustration.svg

Example Scenario

Importance of parameter choices

Higher min_samples or a lower eps requires a higher density necessary to form a cluster.

Example: OpenStreetMap GPS traces in Philadelphia

DBSCAN basics

The function returns two objects, which we call cores and labels. cores contains the indices of each point which is classified as a core.

The length of cores tells you how many core samples we have:

The labels tells you the cluster number each point belongs to. Those points classified as noise receive a cluster number of -1:

The labels array is the same length as our input data, so we can add it as a column in our original data frame

The number of clusters is the number of unique labels minus one (because noise has a label of -1)

We can group by the label column to get the size of each cluster:

The number of noise points is the size of the cluster with label "-1":

If points aren't noise or core samples, they must be edges:

Now let's plot the noise and clusters

Extending DBSCAN beyond just spatial coordinates

DBSCAN can perform high-density clusters from more than just spatial coordinates, as long as they are properly normalized

Exercise: Extracting patterns from NYC taxi rides

I've extracted data for taxi pickups or drop offs occurring in the Williamsburg neighborhood of NYC from the NYC taxi open data.

Includes data for:

Goal: identify clusters of similar taxi rides that are not only clustered spatially, but also clustered for features like hour of day and trip distance

Inspired by this CARTO blog post

See Lecture 11A for solutions!

Lecture 11A is available here