9/3/2020
To follow along today, go to https://github.com/MUSA-550-Fall-2020/week-1
The following line imports the pandas package:
import pandas as pd
The primary objects in pandas are the:
DataFrame
, which is very similar to an Excel spreadsheet, and has rows and named columnsSeries
, which represents a single column of data. A DataFrame
contains one or more Series
and a name for each Series
.The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.
You can think Series
objects as fancier versions of Python's built-in list data type
To create a Series
object:
# city names and population
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
city_names
DataFrame
objects can be created by passing a dict
mapping string
column names to their respective Series
.
cities_df = pd.DataFrame({ 'City Name': city_names, 'Population': population })
cities_df
Note: always try to give your variables meaningful names. It will help immensely when you are trying to debug problems or when you're trying to understand the code you've written months later.
# access columns with a dict-like syntax
cities_df['Population']
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities_list = ['San Francisco', 'San Jose', 'Sacramento']
cities_list[1:3]
Unfortunately the functionality for slicing lists is not that powerful...but pandas will have many more features!
# slicing data frame rows is very similar!
cities_df[1:3]
pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.
cities_df['Population'].min()
NumPy is a popular toolkit for scientific computing.
pandas Series
can be used as arguments to most NumPy functions:
import numpy as np
# calculate the median population value
np.median(cities_df['Population'])
For more complex single-column transformations, you can use Series.apply
. It accepts a function that is applied to each value in the Series
.
For example, we can find which cities have a population greater than a million:
# define our function
def get_large_cities(population):
return population > 1e6
large_cities_sel = cities_df['Population'].apply(get_large_cities)
large_cities_sel
# add the new computed column to our original data frame
cities_df["Large Cities"] = large_cities_sel
# we can also use lambda (unnamed, inline) functions
cities_df["Small Cities"] = cities_df["Population"].apply(
lambda population: population < 1e6
)
# print out
cities_df
We can select the "large" cities by passing the boolean values to the .loc()
function of a DataFrame
:
cities_df['Large Cities']
cities_df.loc[cities_df['Large Cities']]
cities_df['Population'] > 1e6
# this is equivalent to doing
cities_df.loc[(cities_df['Population'] < 1e6)]
Use the Python tilde operator to do a logicial not operation:
# reverse the large cities boolean selection
cities_df.loc[~large_cities_sel]
# or equivalently:
# NOTE: you need to put the whole expression in () and then apply the tilde!
cities_df.loc[~(cities_df['Population'] > 1e6)]
We can use the pandas query function.
The query() function will return a subset of your dataframe based on a string version of the boolean expression.
cities_df.query("Population > 1e6")
The documentation is your best friend! Use the question mark operator!
# use the question mark
pd.DataFrame.loc?
Let's load census data on rentals rates from the 2017 Amercian Community Survey.
We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.
Use the read_csv()
function from pandas. The first argument to read_csv()
is the file path to load.
census_df = pd.read_csv("./data/census/ACS_17_1YR_B25003.csv")
census_df.head(n=5)
Select certain columns and rename them.
## FIRST STEP: let's trim to the columns we want
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
census_df = census_df[columns]
# Peak at the dataframe with trimmed columns
census_df.head()
## STEP 2: Let's rename the columns!
# rename columns
census_df.columns = ['City', 'Total', 'Rentals']
# Return the head (first 5 rows) from the cell
census_df.head()
# Step 3: Add the rental rate as a new column
census_df['Rental Rate'] = census_df['Rentals'] / census_df['Total'] * 100
census_df.head()
# How many rows are in the dataframe? --> use the len() operator
len(census_df)
We can select the largest cities by population using the Series.isin()
function, which checks if each value in the Series
is in the specified list.
# Define the list of city names we want
top_cities_list = [
"Philadelphia city, Pennsylvania",
"New York city, New York",
"Los Angeles city, California",
"Houston city, Texas",
"Chicago city, Illinois",
"Phoenix city, Arizona",
]
# Use the isin() to test whether a row value is in a city
census_df["City"].isin(top_cities_list)
# Now do the selection!
census_df.loc[census_df["City"].isin(top_cities_list)] # select the valid rows
Only 6 rows as expected!
Alternatively, we could use the query() function. The syntax is a bit different (and I always forget and have to look at the documentation).
In the query() string, you can reference existing variables by prefacing the variable name with an "@" symbol.
For example:
# Get the rows where the City column is in "top_cities_list"
census_df.query("City in @top_cities_list")
Python uses the "%" operator to insert variable values into strings. For example, the file name of the data we want to load for 2017 is:
year_str = "17"
path = "./data/census/ACS_%s_1YR_B25003.csv" % year_str
print(path)
See this guide on this type of string formatting for more info.
Python version 3.6 introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.
The variable name should be enclosed in curly braces inside the string:
year_str = "17"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path)
For more info: see this guide
Let's use a for loop to loop over each year value from 2005 to 2017
all_rental_data = []
for year in range(2005, 2018):
#---------------------------------------------
# Step 0: Print out year
# The variable year changes for each iteration of the loop
#----------------------------------------------
print(f"Loading data for year = {year}...")
#---------------------------------
# Step 1: Read data for this year
#---------------------------------
year_str = str(year) # convert integer value of "year" to a string
year_str = year_str[2:] # extract the last two digits of the year string
rental_df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
#---------------------------------
# Step 2: Columns we want
#---------------------------------
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
rental_df = rental_df[columns]
#---------------------------------
# Step 3: Rename columns
#---------------------------------
rental_df.columns = ['City', 'Total', 'Rentals']
#----------------------------------
# Step 4: Calculate the rental rate
#----------------------------------
rental_df['Rental Rate'] = rental_df['Rentals'] / rental_df['Total'] * 100
#----------------------------------
# Step 5: Select the cities we want
#----------------------------------
selection = rental_df['City'].isin(top_cities_list)
top_cities_df = rental_df.loc[selection].copy()
#----------------------------------
# Step 6: Add a column for the year
#----------------------------------
top_cities_df['Year'] = year
#-----------------------------------
# Step 7: Save this dataframe in the list
#-----------------------------------
all_rental_data.append(top_cities_df)
#---------------------------------------------------
# Step 8: After the for loop, combine all of the
# data frames into one along the row axis
#---------------------------------------------------
all_rental_data = pd.concat(all_rental_data, axis=0)
# Note: the pd.concat() took a list of dataframes and converted them
# into a single dataframe!
type(all_rental_data)
all_rental_data.head()
# What about the tail?
all_rental_data.tail()
# How many rows? Hopefully (6 cities x 13 years = 78!)
len(all_rental_data)
matplotlib
¶Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib
.
Import the library we'll use from matplotlib
.
from matplotlib import pyplot as plt
We'll use the plot()
function which plots a simple x
vs y
line.
plot()
function for the sub-selection data frameTo find the unique city names in the 'City' column, we can use the unique()
function:
all_rental_data['City'].unique()
with plt.style.context('ggplot'):
# Step 1: Create a figure and axes to plot on
fig, ax = plt.subplots(figsize=(10,6))
# Step 2: Loop over each city and plot
for city in all_rental_data['City'].unique():
# Print out the city in each loop iteration
print(city)
# Select data for this city
this_city = all_rental_data['City'] == city
city_df = all_rental_data.loc[this_city]
# Plot Year vs Rental Rate
ax.plot(city_df['Year'], city_df['Rental Rate'], label=city, linewidth=4)
# Add a legend
ax.legend(loc=0, ncol=3, fontsize=12)
# Set the y limit
ax.set_ylim(30, 72)
# Add a y label
ax.set_ylabel("Rental Rate in Percent");
Note: we used the "ggplot" theme to style our plots. Matplotlib comes with a number of built in styles: see all of the examples here.
# trim to just Philadelphia first, using this boolean selection index
all_rental_data['City']=='Philadelphia city, Pennsylvania'
# select Philadelphia only
philly = all_rental_data.loc[all_rental_data['City']=='Philadelphia city, Pennsylvania']
philly
Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[]
function like:
philly.loc[393]
This is different than the .iloc[]
function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:
philly.iloc[0]
We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index()
function. For example
philly.reset_index(drop=True)
Next, to calculate our percent difference we need to get the values for Philadelphia in 2005 and 2017:
# select 2005
philly_2005 = philly.loc[philly['Year']==2005]
value_2005 = philly_2005['Rental Rate'].iloc[0]
print("2005 value = ", value_2005)
philly_2005['Rental Rate']
# select 2017
philly_2017 = philly.loc[philly['Year']==2017]
value_2017 = philly_2017['Rental Rate'].iloc[0]
print("2017 value = ", value_2017)
# calculate percent change
percent_change = (value_2017 - value_2005)/ value_2005 * 100
print("change in rental rate in Philadelphia from 2005 to 2017 = ", percent_change)
For more details on the iloc()
function, see the documentation on indexing by position.
Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.
Files were downloaded from https://www.zillow.com/research/data/
home_values = pd.read_csv("./data/zillow/Metro_Zhvi_AllHomes.csv", engine='python')
rent_values = pd.read_csv('./data/zillow/Metro_Zri_SingleFamilyResidenceRental.csv', engine='python')
Note: the engine='python'
keyword is necessary here to load the Zillow data, otherwise you will get a UnicodeDecodeError
exception.
Peek at the first few rows of the ZRI data:
rent_values.head()
And do the same for the ZHVI data:
home_values.head()
valid_cities = [
"New York, NY",
"Chicago, IL",
"Los Angeles-Long Beach-Anaheim, CA",
"Philadelphia, PA",
"Houston, TX",
"Phoenix, AZ",
]
selection = home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
rent_values_trimmed
Unwanted columns can be dropped from the data frame using the drop()
function.
Note that the column axis is the second axis (axis=1
), and if you wanted to remove rows, you could use the first axis (axis=0
).
x = ['SizeRank', 'RegionID']
home_values_final = home_values_trimmed.drop(x, axis=1)
random_list = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(random_list, axis=1)
rent_values_final
Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.
home_values_final
Usually it's better to have data in tidy (also known as long) format.
Tidy datasets are arranged such that each variable is a column and each observation is a row.
In our case, we want to have a column called ZRI
and one called ZHVI
and a row for each month that the indices were measured.
pandas provides the melt()
function for converting from wide formats to tidy formats.
pd.melt?
Now, let's melt our datasets:
ZHVI = pd.melt(
home_values_final, id_vars=["RegionName"], value_name="ZHVI", var_name="Date"
)
ZRI = pd.melt(
rent_values_final, id_vars=["RegionName"], value_name="ZRI", var_name="Date"
)
and take a look:
ZRI.tail()
ZHVI.head()
Note that missing data values are represented as NaN
("not a number")
Another common operation is merging, also known as joining, two datasets.
We can use the merge()
function to merge observations that have the same Date
and RegionName
values.
zillow_data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
zillow_data.head()
Merging is very powerful and the merge can be done in a number of ways. See the infographic on joining in this repository.
Currently our Date
column is stored as a string
.
pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime()
function.
# convert to data
zillow_data['Date'] = pd.to_datetime(zillow_data['Date'])
Quick trick: Series
that hold Datetime
objects have a dt
attribute that let's you grab parts of the date easily.
For example, we can easily add new columns for the month and year using:
# Note the the dtype is now datetime64[ns]
zillow_data['Date'].head()
zillow_data['Month'] = zillow_data['Date'].dt.month
zillow_data['Year'] = zillow_data['Date'].dt.year
zillow_data.head()
Year
¶pandas is especially useful for grouping and aggregating data via the groupby()
function.
From the pandas documentation, groupby
means:
The documentation is available here.
We can calculate annual averages for each year by grouping by the RegionName
and Year
columns and taking the mean of our desired column. For example:
# calculate mean values for each Year and City (RegionName)
annual_ZHVI = zillow_data.groupby(['RegionName', 'Year'])['ZHVI'].mean()
annual_ZRI = zillow_data.groupby(['RegionName', 'Year'])['ZRI'].mean()
print(type(annual_ZHVI))
annual_ZHVI.head()
Note that here that the result is indexed by the columns we grouped by (RegionName
and Year
).
We can reset the index so that the index values are listed as columns in the data frame again.
annual_ZHVI = annual_ZHVI.reset_index()
annual_ZRI = annual_ZRI.reset_index()
annual_ZHVI.head(n=50)
with plt.style.context("ggplot"):
# Create figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Plot for each unique city
for city in annual_ZHVI["RegionName"].unique():
# select the data for this city
selection = annual_ZHVI["RegionName"] == city
df = annual_ZHVI.loc[selection]
# plot
ax.plot(df["Year"], df["ZHVI"] / 1e3, label=city, linewidth=4)
# Format the axes
ax.set_ylim(50, 800)
ax.legend(loc=0, ncol=2, fontsize=12)
ax.set_ylabel("Zillow Home Value Index\n(in thousands of dollars)")
with plt.style.context('ggplot'):
# Create the figure and axes
fig, ax = plt.subplots(figsize=(10,6))
# Loop over the cities to plot each one
for city in annual_ZRI['RegionName'].unique():
# Select the city data
selection = annual_ZRI['RegionName'] == city
df = annual_ZRI.loc[selection]
# Plot
ax.plot(df['Year'], df['ZRI'], label=city, linewidth=4)
# Format
ax.set_ylim(1000, 3300)
ax.legend(loc=0, ncol=2)
ax.set_ylabel('Zillow Rent Index (in dollars)')
Available on GitHub:
Two parts:
Due date: next Thursday 9/10 by the start of class (6pm)
Announcements & reminders will go out on Piazza!