Welcome to MUSA 550:
Geospatial Data Science in Python

Today¶

Course logistics
Using Jupyter Notebooks
Introduction to Python & Pandas

My day job¶

My name is Nick Hand
I am the Director of Finance, Data, and Policy for City Controller Rebecca Rhynhart
Goals of our team:
- Objective, data-driven analysis of financial policies impacting Philadelphia
- Increasing transparency through data releases and interactive reports

My team covers a range of policy issues in the city:

--> Check out https://controller.phila.gov/policy-analysis

Previously:
Astrophysics Ph.D. at Berkeley

How did I get here?¶

Astrophysics to data science is becoming increasingly common
Landing a job through Twitter: https://www.parkingjawn.com
- Dashboard visualization of monthly parking tickets in Philadelphia
- Data from OpenDataPhilly

Goal: exploring and extracting insight from complex datasets

Parking Jawn¶

https://www.parkingjawn.com
Cross-filtering
Different views
See drop in parking tickets over Jan 24-26, 2016 due to snowstorm

Class logistics¶

Lectures¶

Fully remote — all lectures will be recorded and posted to the course's Canvas site
3 hours of lecture material per week
Classes will be a mix of lecturing, interactive demos, and in-class lab time
Students learning from multiple time zones — trying to find best solution for everyone

Course design

Some weeks (mostly earlier in the course) will feature two synchronous 90 minute Zoom classes Tues./Thurs.
Other weeks (mostly later in the course) will employ a "flipped" classroom model: a pre-recorded 90 minute lecture will be available at the beginning of the week and the synchronous Thurs. lecture will be lab-based to work through material

Note: there will be an introductory survey for all students and it will ask about the possibility of moving lectures and preferred time slots (this will only occur if a time that works for everyone can be found)

Office Hours¶

Office hours will also be remote and by appointment only
Will be available for 2-hour time slot during the week (but will also work with schedules to find times that work)
Times are to be determined
Course has one teaching assistant: Eugene Chong
- Email: echong91@upenn.edu
- Office hours: TBD

Policies¶

Available at: https://musa-550-fall-2020.github.io/syllabus/policies/

Course Websites¶

Course has four websites (sorry!). They are:

Main Course: https://musa-550-fall-2020.github.io
Github: https://github.com/MUSA-550-Fall-2020
Piazza: https://piazza.com/upenn/fall2020/musa550/home
Canvas: https://canvas.upenn.edu/courses/1533812

Each will have its own purpose:

Main course website:¶

Course schedule with links to weekly slides
Resources for learning Python, setting up software, and dealing with common issues
General course info and policies

Github¶

Github organization set up for the course
Each week and assignment will have its own Github repository
Assignments will also be submitted through Github

Piazza¶

Will be used for question & answer forum for course materials and assignments
Announcements will also be made here so make sure you check frequently or turn on your notifications!
Main method of communication will be through Piazza announcements
Participation grade will also be determined by user activity on the Piazza forum

Canvas¶

Will be used to host recorded lecture files
Will also contain Zoom information for remote lectures via the course calendar

Course Website: Highlights¶

Syllabus
- Course information
- Course policies
Python resources
Instructions for setting up your local software environment
Guides
Cheatsheets
- Python
- Conda
- Matplotlib
- Pandas
- Seaborn

Course Github¶

The goals of this course¶

Provide students with the knowledge and tools to turn data into meaningful insights and stories
Focus on the modern data science tools within the Python ecosystem
The pipeline approach to data science:
- gathering, storing, analyzing, and visualizing data to tell stories
Real-world applications of analysis techniques in the urban planning and public policy realm

What we'll cover¶

Module 1¶

Exploratory Data Science: Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python

Module 2¶

Introduction to Geospatial Data Science: Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.

Module 3¶

Data Ingestion & Big Data: Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.

Module 4¶

Geospatial Data Science in the Wild: Armed with the necessary data science tools, students will be introduced to a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.

Module 5¶

From Exploration to Storytelling: The final module will teach students to present their analysis results using web-based formats to transform their insights into interactive stories.

Assignments and grading¶

Grading:
- 50% homework
- 40% final project
- 10% participation (based on Piazza participation)
Late policy: will be accepted late but with a penalty

Homeworks will be assigned (roughly) every two weeks. You must complete five of the seven homework assignments. Four of the assignments are required, and you are allowed to choose the last assignment to complete (out of the remaining three options).

Screen%20Shot%202020-08-31%20at%208.25.12%20PM.png

Screen%20Shot%202020-08-31%20at%208.25.26%20PM.png

Screen%20Shot%202020-08-31%20at%208.25.52%20PM.png

Final Project¶

The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.

Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience.

More info will be posted here: https://github.com/MUSA-550-Fall-2020/final-project

Any questions so far?¶

Initial survey¶

https://www.surveymonkey.com/r/TSVNMGP

Okay, let's get started...¶

The Incredible Growth of Python¶

A StackOverflow analysis

The rise of the Jupyter notebook¶

The engine of collaborative data science¶

First started by a physics grad student around 2001
Known as the IPython notebook originally
Starting getting popular in ~2011
First funding received in 2015 $\rightarrow$ the Jupyter notebook was born

Google searches for Jupyter notebook¶

Key features¶

Aimed at "computational narratives" — telling stories with data
interactive, reproducible, shareable, user-friendly, visualization-focused

Very versatile: good for both exploratory data analysis and polished finished products

Beyond the Jupyter notebook¶

Google's Colaboratory¶

A fancier notebook experience built on top of Jupyter notebook
Running in the cloud on Google's servers
An internal Google product that was recently released publicly
Very popular for Python-based machine learning
Won't need to use much in this course

See https://colab.research.google.com/notebooks/welcome.ipynb

Binder: https://mybinder.org ¶

Screen%20Shot%202020-08-31%20at%208.54.43%20PM.png

Allows you to launch a repository of Jupyter notebooks on GitHub in the cloud¶

Note: as a free service, it can be a bit slow sometimes

Weekly lectures are available on Binder¶

Weekly Workflow¶

Set up local Python environment as part of first homework assignment (week 1, posted on Thurs. 9/3)
Each week, you will have two options to follow along with lectures:
1. Using Binder in the cloud, launching via the button on the week's repository
2. Download the week's repository to your laptop and launch the notebook locally
Work on homeworks locally on your laptop — Binder is only a temporary environment (no save features)

To follow along today, go to https://github.com/MUSA-550-Fall-2020/week-1

Now to the fun stuff...¶

These slides are a Jupyter notebook.

A mix of code cells and text cells in Markdown. Change the type of cell in the top menu bar.

# Comments begin with a "#" character in Python
# A simple code cell
# SHIFT-ENTER to execute


x = 10
print(x)

10

Python data types¶

# integer
a = 10

# float
b = 10.5

# string
c = "this is a test string"

# lists
d = list(range(10))

# booleans 
e = True

# dictionaries
f = {'key1': 1, "key2": 2}

print(a)
print(b)
print(c)
print(d)
print(e)
print(f)

10
10.5
this is a test string
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
True
{'key1': 1, 'key2': 2}

Alternative method for creating a dictionary¶

f = dict(key1=1, key2=2, key3=3)

Accessing dictionary values¶

# access the value with key 'key1'
f['key1']

1

Accessing list values¶

d

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# access the second list entry (0 is the first index)
d[9]

9

Accessing characters of a string¶

c

'this is a test string'

# the first character
c[0]

't'

Iterators and for loops¶

# Python code
result = 0
for i in range(100):
    result = result + i

  File "<ipython-input-30-64622c786bbd>", line 4
    result = result + i
         ^
IndentationError: expected an indented block

print(result)

4950

Python's inline syntax¶

a = range(10) # this is an iterator

print(a)

range(0, 10)

# convert it to a list explicitly
a = list(range(10))
print(a)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]
print(a)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Python functions¶

def function_name(arg1, arg2, arg3):

    .
    .
    .
    code lines (indented)
    .
    .
    .

    return result

def compute_square(x):
    return x * x

sq = compute_square(5)
print(sq)

25

Keywords: arguments with a default!¶

def compute_product(x, y=5):
    return x * y

# use the default value for y
print(compute_product(5))

25

# specify a y value other than the default
print(compute_product(5, 10))

50

# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(y=2, x=5))

10
10

# argument names must match the function signature though!
print(compute_product(5, z=5))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-5db83a932315> in <module>
      1 # argument names must match the function signature though!
----> 2 print(compute_product(5, z=5))

TypeError: compute_product() got an unexpected keyword argument 'z'

Getting help in the notebook¶

Use tab auto-completion and the ? and ?? operators

this_variable_has_a_long_name = 5

# try hitting tab after typing this_ 
this_variable_has_a_long_name

5

# try typing "r" and then tab
list(range(5, 10, 2))

[5, 7, 9]

# Forget how to create a range? --> use the help message
range?

Peeking at the source code for a function¶

Use the ?? operator

# Lets re-define compute_product() and add a docstring between """ """
def compute_product(x, y=5):
    """This compute the product of x and y"""
    return x * y

compute_product?

The question mark operator gives you access to the help message for any variable or function. I use this frequently and it is the primary method I understand what functions do.

Getting more Python help¶

This was a very brief introduction. Additional Python tutorials are listed on our course website under "Resources"

https://musa-550-fall-2020.github.io/resources/python/

Screen%20Shot%202020-08-31%20at%209.07.04%20PM-2.png

Recommend tutorial for students with little Python background:

Practical Python Programming

There are also a few good resources from the Berkeley Data Science Institute:

https://bids.github.io/2016-01-14-berkeley/python/00-python-intro.html (notebook version)
Python for Social Science, a free online book
Many more resources are listed here: http://python.berkeley.edu/resources/

The Data Science Handbook¶

The The Python Data Science Handbook is a free, online textbook covering the Python basics needed in this course. In particular, the first four chapters are excellent:

Note that you can click on the "Open in Colab" button for each chapter and run the examples interactively using Google Colab.

One more thing: working outside the notebook¶

In this class, we will almost exclusively work inside Jupyter notebooks — you'll be writing Python code and doing data analysis directly in the notebook.

The more traditional method of using Python is to put your code into a .py file and execute it via the command line (known as the Anaconda Prompt on Windows or Terminal app on MacOS).

See this section of the Practical Python Programming tutorial for more info.

There is a file called hello_world.py in the repository for week 1. If we execute it, it should print out "Hello, World" to the command line.

Let's try it out.

Notebook tip¶

You can run terminal commands directly in the Jupyter notebook's "code" cell by starting the line with a "!"

To list all of the files in the current folder (the "current working directory"), use the ls command:

! ls

README.md                 hello_world.py            lecture-1.ipynb
__pycache__               joining_infographic.jpg   outline.md
data                      lecture-1-solutions.ipynb
environment.yml           lecture-1.html

We see the hello_world.py file listed. Now let's execute it on the command line by using the python command:

# We can run the same code right in the browser!
print("Hello World!")

Hello World!

! python hello_world.py

Hello World!

Success!

Code editors¶

When writing software outside the notebook, it's useful to have an application known as a "code editor". This will provide a nice interface for writing Python code and some even have fancy features, like real-time syntax checking and syntax highlighting.

My recommended option is Visual Studio Code.

Welcome to MUSA 550:Geospatial Data Science in Python