Week 5B
Getting Data Part 1: Working with APIs

Oct 1, 2020

Week #5 Agenda

Last time:

  • Introduction to APIs
  • Pulling census data and shape files using Python

Today:

  • API Example: Lead poisoning in Philadelphia
  • Using the Twitter API
    • Plotting geocoded tweets
    • Word frequencies
    • Sentiment analysis
In [1]:
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Point

from matplotlib import pyplot as plt
import seaborn as sns

import hvplot.pandas
import holoviews as hv

hv.extension("bokeh")
%matplotlib inline
In [2]:
import carto2gpd
import cenpy

Last time: Querying the census with cenpy

First, let's initialize a connection to the 2018 5-year ACS dataset

In [3]:
acs = cenpy.remote.APIConnection("ACSDT5Y2018")
In [4]:
# Set the map service for pulling geometries
acs.set_mapservice("tigerWMS_ACS2018")
Out[4]:
Connection to American Community Survey: 1-Year Estimates: Detailed Tables 5-Year(ID: https://api.census.gov/data/id/ACSDT5Y2018)
With MapServer: Census Current (2018) WMS

Exercise: lead poisoning in Philadelphia

Let's use demographic census data in Philadelphia by census tract and compare to a dataset of childhood lead poisoning.

Step 1. Download the demographic data for Philadelphia

  • We are going to be examining the percent of the population that identifies as black in each census tract, so we will need:
    • Total population: 'B03002_001E'
    • Non-Hispanic, Black population: 'B03002_004E'
  • You'll want to use the state --> county --> tract hierarchy , using the * operator to get all tracts in Philadelphia county
  • Remember PA has a FIPS code of "42" and Philadelphia County is "101"
In [5]:
philly_demo_tract = acs.query(
    cols=["NAME", "B03002_001E", "B03002_004E"],
    geo_unit="tract:*",
    geo_filter={"state" : "42", 
                "county" : "101"},
)

Step 2. Download and merge in the census tract geometries

In [6]:
# Census tracts are the 9th layer (index 8 starting from 0)
acs.mapservice.layers[8]
Out[6]:
(ESRILayer) Census Tracts
In [7]:
# Use SQL to return geometries only for Philadelphia County in PA
where_clause = "STATE = 42 AND COUNTY = 101"

# Query for census tracts
philly_census_tracts = acs.mapservice.layers[8].query(where=where_clause)
/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
In [8]:
philly_census_tracts.head(n=1)
Out[8]:
MTFCC OID GEOID STATE COUNTY TRACT BASENAME NAME LSADC FUNCSTAT AREALAND AREAWATER CENTLAT CENTLON INTPTLAT INTPTLON OBJECTID STGEOMETRY.AREA STGEOMETRY.LEN geometry
0 G5020 207583717001577 42101036501 42 101 036501 365.01 Census Tract 365.01 CT S 1816210 7169 +40.1293202 -075.0131963 +40.1288937 -075.0122761 411 3.122595e+06 7152.056313 POLYGON ((-8351601.195 4884859.264, -8351523.0...
In [9]:
philly_demo_tract.head(n=1)
Out[9]:
NAME B03002_001E B03002_004E state county tract
0 Census Tract 359, Philadelphia County, Pennsyl... 5745 144 42 101 035900
In [10]:
philly_demo_tract = philly_census_tracts.merge(
    philly_demo_tract,
    left_on=["STATE", "COUNTY", "TRACT"],
    right_on=["state", "county", "tract"],
)

Step 3. Calculate the black percentage

Add a new column to your data called percent_black.

Important: Make sure you convert the data to floats!

In [11]:
for col in ['B03002_001E', 'B03002_004E']:
    philly_demo_tract[col] = philly_demo_tract[col].astype(float)
In [12]:
philly_demo_tract['percent_black'] = 100 * philly_demo_tract['B03002_004E'] / philly_demo_tract['B03002_001E']

Step 4. Query CARTO to get the childhood lead levels by census tract

In [13]:
table_name = 'child_blood_lead_levels_by_ct'
lead_levels = carto2gpd.get("https://phl.carto.com/api/v2/sql", table_name)
In [14]:
lead_levels.head()
Out[14]:
geometry cartodb_id census_tract data_redacted num_bll_5plus num_screen perc_5plus
0 POLYGON ((-75.14147 39.95171, -75.14150 39.951... 1 42101000100 False 0.0 100.0 0.0
1 POLYGON ((-75.16238 39.95766, -75.16236 39.957... 2 42101000200 True NaN 109.0 NaN
2 POLYGON ((-75.17821 39.95981, -75.17743 39.959... 3 42101000300 True NaN 110.0 NaN
3 POLYGON ((-75.17299 39.95464, -75.17301 39.954... 4 42101000401 True NaN 61.0 NaN
4 POLYGON ((-75.16333 39.95334, -75.16340 39.953... 5 42101000402 False 0.0 41.0 0.0

Step 5. Remove census tracts with missing lead measurements

See the .dropna() function and the subset= keyword.

In [15]:
lead_levels = lead_levels.dropna(subset=['perc_5plus'])

Step 6. Merge the demographic and lead level data frames

  • From the lead data, we only need the 'census_tract' and 'perc_5plus'. Before merging, trim your data to only these columns.
  • You can perform the merge by comparing the census_tract and GEOID fields
  • Remember: when merging, the left data frame should be the GeoDataFrame — use GeoDataFrame.merge(...)
In [16]:
# Trim the lead levels data
lead_levels_trimmed = lead_levels[['census_tract', 'perc_5plus']]

# Merge into the demographic data
merged = philly_demo_tract.merge(lead_levels_trimmed, 
                                 how='left', 
                                 left_on='GEOID', 
                                 right_on='census_tract')

Step 7. Trim to the columns we need

We only need the 'geometry', 'percent_black', and 'perc_5plus', and 'NAME' columns

In [17]:
merged = merged[['NAME_x', 'geometry', 'percent_black', 'perc_5plus']]
In [19]:
merged.head()
Out[19]:
NAME_x geometry percent_black perc_5plus
0 Census Tract 365.01 POLYGON ((-8351601.195 4884859.264, -8351523.0... 10.528316 NaN
1 Census Tract 8.01 POLYGON ((-8369365.336 4858608.211, -8369184.4... 4.828551 NaN
2 Census Tract 1 POLYGON ((-8365905.972 4858674.427, -8365885.3... 7.513202 0.0
3 Census Tract 2 POLYGON ((-8367094.196 4859452.933, -8367072.1... 5.628882 NaN
4 Census Tract 3 POLYGON ((-8368992.305 4860135.356, -8368934.1... 6.416957 NaN

Step 8. Plot the results

Make two plots:

  1. A two panel, side-by-side chart showing a choropleth of the lead levels and the percent black
  2. A scatter plot showing the percentage

You can make these using hvplot or geopandas/matplotlib — whichever you prefer!

In [20]:
# Lead levels plot
img1 = merged.hvplot(geo=True, 
                     crs=3857, 
                     c='perc_5plus', 
                     width=500, 
                     height=400, 
                     cmap='viridis', 
                     title='Lead Levels')

# Percent black 
img2 = merged.hvplot(geo=True,
                     crs=3857,
                     c='percent_black', 
                     width=500, 
                     height=400, 
                     cmap='viridis', 
                     title='% Black')

img1 + img2
/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/holoviews/plotting/util.py:685: MatplotlibDeprecationWarning: The global colormaps dictionary is no longer considered public API.
  [cmap for cmap in cm.cmap_d if not
/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/holoviews/plotting/util.py:685: MatplotlibDeprecationWarning: The global colormaps dictionary is no longer considered public API.
  [cmap for cmap in cm.cmap_d if not
/Users/nhand/opt/miniconda3/envs/musa-550-fall-2020/lib/python3.7/site-packages/holoviews/plotting/util.py:685: MatplotlibDeprecationWarning: The global colormaps dictionary is no longer considered public API.
  [cmap for cmap in cm.cmap_d if not
Out[20]:
In [21]:
cols = ['perc_5plus', 'percent_black']
merged[cols].hvplot.scatter(x=cols[0], y=cols[1])
Out[21]:

Step 9. Use seaborn to plot a 2d density map

In the previous plots, it's still hard to see the relationship. Use the kdeplot() function in seaborn to better visualize the relationship.

You will need to remove any NaN entries first.

You should see two peaks in the distribution clearly now!

In [22]:
fig, ax = plt.subplots(figsize=(8,6))

X = merged.dropna()
sns.kdeplot(X['perc_5plus'], X['percent_black'], ax=ax)
Out[22]:
<AxesSubplot:xlabel='perc_5plus', ylabel='percent_black'>

API Example #2: the Twitter API

Twitter provides a rich source of information, but the challenge is how to extract the information from semi-structured data.

Semi-structured data

Data that contains some elements that cannot be easily consumed by computers

Examples: human-readable text, audio, images, etc

Key challenges

  • Text mining: analyzing blocks of text to extract the relevant pieces of information
  • Natural language processing (NLP): programming computers to process and analyze human languages
  • Sentiment analysis: analyzing blocks of text to derive the attitude or emotional state of the person

First: Getting an API key

Step 1: Make a Twitter account

Step 2: Apply for Developer access

See: https://developer.twitter.com/apps

You will need to apply for a Developer Account, answering a few questions, and then confirm your email address.

Once you submit you'll application, you'll need to wait for approval...usually this happens immediately, but there can sometimes be a short delay

Sample answer

Needs to be at least 100 characters

  1. I'm using Twitter's API to perform a sentiment analysis as part of a class teaching Python. I will be interacting with the API using the Python package tweepy.
  2. I plan to analyze tweets to understand topic sentiments.
  3. I will not be interacting with Twitter users as part of
  4. I will not be displaying Twitter content off of Twitter.

Step 4: Create your API keys

In the "Keys and Tokens" section, generate new access tokens.

You will need the Consumer API keys and access tokens to use the Twitter API.

We'll be using tweepy to search recent tweets

The standard, free API let's you search tweets from the last 7 days

For more information, see the Twitter Developer docs

Tweepy: a Python interface to Twitter

https://tweepy.readthedocs.io

In [23]:
import tweepy as tw

Define your API keys

In [96]:
# INPUT YOUR API AND ACCESS KEYS HERE
api_key = ""
api_key_secret = ''
access_token = ''
access_token_secret = '' 

Initialize an API object

We need to:

  • step up authentication
  • intialize a tweepy.API object
In [25]:
auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

Rate Limits

Be careful: With the free API, you are allowed 15 API calls every 15 minutes.

See the Rate Limits documentation for more details.

What does wait_on_rate_limit do?

If you run into a rate limit while pulling tweets, this will tell tweepy to wait 15 minutes until it can continue.

Unfortunately, you need to sign up (and pay) for the premium API to avoid these rate limits.

How to find out how many requests you've made?

In [26]:
data = api.rate_limit_status() 
In [27]:
data
Out[27]:
{'rate_limit_context': {'access_token': '706239336-17wuVyuLZQ3Be80GuB4V8vWAz5PdfVloASHFleVH'},
 'resources': {'lists': {'/lists/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/lists/memberships': {'limit': 75, 'remaining': 75, 'reset': 1601599777},
   '/lists/subscribers/show': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/lists/members': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/lists/subscriptions': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/lists/show': {'limit': 75, 'remaining': 75, 'reset': 1601599777},
   '/lists/ownerships': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/lists/subscribers': {'limit': 180, 'remaining': 180, 'reset': 1601599777},
   '/lists/members/show': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/lists/statuses': {'limit': 900, 'remaining': 900, 'reset': 1601599777}},
  'application': {'/application/rate_limit_status': {'limit': 180,
    'remaining': 179,
    'reset': 1601599777}},
  'mutes': {'/mutes/users/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/mutes/users/ids': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'admin_users': {'/admin_users': {'limit': 2000,
    'remaining': 2000,
    'reset': 1601599777}},
  'live_video_stream': {'/live_video_stream/status/:id': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777}},
  'friendships': {'/friendships/outgoing': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/friendships/list': {'limit': 200, 'remaining': 200, 'reset': 1601599777},
   '/friendships/no_retweets/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/friendships/lookup': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/friendships/incoming': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/friendships/show': {'limit': 180, 'remaining': 180, 'reset': 1601599777}},
  'guide': {'/guide': {'limit': 180, 'remaining': 180, 'reset': 1601599777},
   '/guide/get_explore_locations': {'limit': 100,
    'remaining': 100,
    'reset': 1601599777},
   '/guide/explore_locations_with_autocomplete': {'limit': 200,
    'remaining': 200,
    'reset': 1601599777}},
  'auth': {'/auth/csrf_token': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'blocks': {'/blocks/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/blocks/ids': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'geo': {'/geo/similar_places': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/geo/place_page': {'limit': 75, 'remaining': 75, 'reset': 1601599777},
   '/geo/id/:place_id': {'limit': 75, 'remaining': 75, 'reset': 1601599777},
   '/geo/reverse_geocode': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/geo/search': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'users': {'/users/': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/users/:id': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/users/report_spam': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/users/contributors/pending': {'limit': 2000,
    'remaining': 2000,
    'reset': 1601599777},
   '/users/show/:id': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/users/:id/tweets': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/users/search': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/users/suggestions/:slug': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/users/contributees/pending': {'limit': 200,
    'remaining': 200,
    'reset': 1601599777},
   '/users/profile_banner': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/users/by/username/:handle/tweets': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/users/derived_info': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/users/suggestions/:slug/members': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/users/:id/mentions': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/users/by/username/:username': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/users/lookup': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/users/suggestions': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/users/by/username/:handle/mentions': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/users/by': {'limit': 900, 'remaining': 900, 'reset': 1601599777}},
  'teams': {'/teams/authorize': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'followers': {'/followers/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/followers/list': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'collections': {'/collections/list': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777},
   '/collections/entries': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777},
   '/collections/show': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777}},
  'statuses': {'/statuses/retweeters/ids': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777},
   '/statuses/retweets_of_me': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777},
   '/statuses/home_timeline': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/statuses/show/:id': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/statuses/user_timeline': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/statuses/friends': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/statuses/retweets/:id': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777},
   '/statuses/mentions_timeline': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777},
   '/statuses/oembed': {'limit': 180, 'remaining': 180, 'reset': 1601599777},
   '/statuses/lookup': {'limit': 900, 'remaining': 900, 'reset': 1601599777}},
  'custom_profiles': {'/custom_profiles/list': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/custom_profiles/show': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777}},
  'webhooks': {'/webhooks/subscriptions/direct_messages': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/webhooks': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'contacts': {'/contacts/uploaded_by': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777},
   '/contacts/users': {'limit': 300, 'remaining': 300, 'reset': 1601599777},
   '/contacts/addressbook': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777},
   '/contacts/users_and_uploaded_by': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777},
   '/contacts/delete/status': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777}},
  'labs': {'/labs/2/platform_manipulation/reports': {'limit': 5,
    'remaining': 5,
    'reset': 1601599777},
   '/labs/:version/tweets/:id/hidden&PUT': {'limit': 10,
    'remaining': 10,
    'reset': 1601599777},
   '/labs/:version/tweets/stream/filter/': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/labs/:version/users/:id/tweets': {'limit': 225,
    'remaining': 225,
    'reset': 1601599777},
   '/labs/2/reports': {'limit': 5, 'remaining': 5, 'reset': 1601599777},
   '/labs/:version/tweets/stream/filter/rules&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/labs/:version/tweets/stream/sample': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/labs/:version/users/by/username/:handle/tweets': {'limit': 225,
    'remaining': 225,
    'reset': 1601599777},
   '/labs/:version/tweets/metrics/private': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/labs/:version/tweets/stream/filter/rules/:instance_name': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/labs/:version/tweets/*': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/labs/:version/users/*': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/labs/:version/tweets/stream/filter/:instance_name': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/labs/:version/tweets/stream/filter/rules/': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/labs/:version/tweets/stream/compliance': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777},
   '/labs/:version/tweets/search': {'limit': 225,
    'remaining': 225,
    'reset': 1601599777}},
  'i': {'/i/config': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/i/tfb/v1/smb/web/:account_id/payment/save': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'tweet_prompts': {'/tweet_prompts/report_interaction': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/tweet_prompts/show': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777}},
  'moments': {'/moments/statuses/update': {'limit': 5,
    'remaining': 5,
    'reset': 1601599777},
   '/moments/create': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/moments/permissions': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777}},
  'limiter_scalding_report_creation': {'/limiter_scalding_report_creation': {'limit': 4500,
    'remaining': 4500,
    'reset': 1601599777}},
  'fleets': {'/fleets/:version/viewers': {'limit': 100,
    'remaining': 100,
    'reset': 1601599777},
   '/fleets/:version/delete': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/fleets/:version/create': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/fleets/:version/user_fleets': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/fleets/:version/fleetline': {'limit': 100,
    'remaining': 100,
    'reset': 1601599777},
   '/fleets/:version/fleet_threads': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777},
   '/fleets/:version/home_timeline': {'limit': 100,
    'remaining': 100,
    'reset': 1601599777},
   '/fleets/:version/mark_read': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777}},
  'help': {'/help/tos': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/help/configuration': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/help/settings': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/help/privacy': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/help/languages': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'feedback': {'/feedback/show/:id': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/feedback/events': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777}},
  'business_experience': {'/business_experience/dashboard_settings/destroy': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/business_experience/dashboard_features': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/business_experience/keywords': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/business_experience/dashboard_settings/update': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/business_experience/dashboard_settings/show': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777}},
  'graphql&POST': {'/graphql&POST': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'friends': {'/friends/following/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/friends/following/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/friends/list': {'limit': 15, 'remaining': 15, 'reset': 1601599777},
   '/friends/ids': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'sandbox': {'/sandbox/account_activity/webhooks/:id/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777}},
  'drafts': {'/drafts/statuses/update': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/drafts/statuses/destroy': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/drafts/statuses/ids': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/drafts/statuses/list': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/drafts/statuses/show': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/drafts/statuses/create': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777}},
  'direct_messages': {'/direct_messages/sent': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777},
   '/direct_messages/broadcasts/list': {'limit': 60,
    'remaining': 60,
    'reset': 1601599777},
   '/direct_messages/subscribers/lists/members/show': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777},
   '/direct_messages/mark_read': {'limit': 1000,
    'remaining': 1000,
    'reset': 1601599777},
   '/direct_messages/subscribers/ids': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/direct_messages/sent_and_received': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777},
   '/direct_messages/broadcasts/statuses/list': {'limit': 60,
    'remaining': 60,
    'reset': 1601599777},
   '/direct_messages': {'limit': 300, 'remaining': 300, 'reset': 1601599777},
   '/direct_messages/subscribers/lists/members/ids': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/direct_messages/subscribers/show': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/direct_messages/broadcasts/show': {'limit': 60,
    'remaining': 60,
    'reset': 1601599777},
   '/direct_messages/broadcasts/statuses/show': {'limit': 60,
    'remaining': 60,
    'reset': 1601599777},
   '/direct_messages/subscribers/lists/list': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/direct_messages/show': {'limit': 300,
    'remaining': 300,
    'reset': 1601599777},
   '/direct_messages/subscribers/lists/show': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/direct_messages/events/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/direct_messages/events/show': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'media': {'/media/upload': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777}},
  'traffic': {'/traffic/map': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'strato': {'/strato/column/None/:id/cms/*': {'limit': 150,
    'remaining': 150,
    'reset': 1601599777}},
  'account_activity': {'/account_activity/all/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/all/:instance_name/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777},
   '/account_activity/direct_messages/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/webhooks/:id/subscriptions/direct_messages/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/webhooks/:id/subscriptions/all': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777},
   '/account_activity/direct_messages/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/webhooks/:id/subscriptions/all/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/webhooks/:id/subscriptions/direct_messages': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777},
   '/account_activity/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/direct_messages/:instance_name/subscriptions': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account_activity/webhooks/:id/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1601599777},
   '/account_activity/all/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'account': {'/account/login_verification_enrollment': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account/update_profile': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/account/authenticate_web_view': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/account/verify_credentials': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777},
   '/account/settings': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'safety': {'/safety/detection_feedback': {'limit': 450000,
    'remaining': 450000,
    'reset': 1601599777}},
  'favorites': {'/favorites/list': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777}},
  'device': {'/device/token': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'tweets': {'/tweets/search/recent': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777},
   '/tweets/search/stream/rules': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/tweets/sample/stream': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/tweets/': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/tweets/search/stream': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777},
   '/tweets/search/:product/:label': {'limit': 1800,
    'remaining': 1800,
    'reset': 1601599777},
   '/tweets/search/:product/:instance/counts': {'limit': 900,
    'remaining': 900,
    'reset': 1601599777},
   '/tweets/search/stream/rules/validation&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/tweets/:id': {'limit': 900, 'remaining': 900, 'reset': 1601599777},
   '/tweets/search/stream/rules&DELETE': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/tweets/search/stream/rules&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/tweets/:id/hidden&PUT': {'limit': 50,
    'remaining': 50,
    'reset': 1601599777}},
  'saved_searches': {'/saved_searches/destroy/:id': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/saved_searches/show/:id': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/saved_searches/list': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}},
  'oauth': {'/oauth/revoke': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777},
   '/oauth/invalidate_token': {'limit': 450,
    'remaining': 450,
    'reset': 1601599777},
   '/oauth/revoke_html': {'limit': 15, 'remaining': 15, 'reset': 1601599777}},
  'search': {'/search/tweets': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777}},
  'trends': {'/trends/closest': {'limit': 75,
    'remaining': 75,
    'reset': 1601599777},
   '/trends/available': {'limit': 75, 'remaining': 75, 'reset': 1601599777},
   '/trends/place': {'limit': 75, 'remaining': 75, 'reset': 1601599777}},
  'live_pipeline': {'/live_pipeline/events': {'limit': 180,
    'remaining': 180,
    'reset': 1601599777}},
  'graphql': {'/graphql': {'limit': 15,
    'remaining': 15,
    'reset': 1601599777}}}}
In [28]:
data['resources']['search']
Out[28]:
{'/search/tweets': {'limit': 180, 'remaining': 180, 'reset': 1601599777}}

Tip: converting a time stamp to a date

In [29]:
import datetime
datetime.datetime.fromtimestamp(1601594736)
Out[29]:
datetime.datetime(2020, 10, 1, 19, 25, 36)

Several different APIs available

Including user tweets, mentions, searching keywords, favoriting, direct messages, and more...

See the Tweepy API documentation

You can also stream tweets

You can set up a listener to listen for new tweets and download them in real time (subject to rate limits).

We won't focus on this, but there is a nice tutorial on the Tweepy documentation.

You can also tweet (if you want!)

You can post tweets using the update_status() function. For example:

tweet = 'Hello World! @PennMusa'
api.update_status(tweet)

We'll focus on the search API

We'll use a tweepy.Cursor object to query the API.

In [31]:
# collect tweets related to the phillies
search_words = "#phillies"
In [32]:
# initialize the cursor
cursor = tw.Cursor(api.search,
                   q=search_words,
                   lang="en",
                   tweet_mode='extended')
cursor
Out[32]:
<tweepy.cursor.Cursor at 0x7fd0897f8690>

Next, specify how many tweets we want

Use the Cursor.items() function:

In [33]:
# select 5 tweets
tweets = cursor.items(5)
tweets
Out[33]:
<tweepy.cursor.ItemIterator at 0x7fd0898012d0>

Python iterators

As the name suggests, iterators need to be iterated over to return objects. In our case, we can use a for loop to iterate through the tweets that we pulled from the API.

In [34]:
# Iterate on tweets
for tweet in tweets:
    print(tweet.full_text)
RT @MLBcathedrals: The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.c…
RT @MLBcathedrals: The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.c…
RT @MLBcathedrals: The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.c…
RT @MLBcathedrals: The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.c…
RT @MLBcathedrals: The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.c…

The concept of "paging"

Unfortunately, there is no way to search for tweets between specific dates.

The API pulls tweets from the most recent page of the search result, and then grabs tweets from the previous page, and so on, to return the requested number of tweets.

Customizing our search query

The API documentation has examples of different query string use cases

Examples

Let's remove retweets

In [35]:
new_search = search_words + " -filter:retweets"

Get a new tweets using our new search query:

In [36]:
cursor = tw.Cursor(api.search,
                   q=new_search,
                   lang="en",
                   tweet_mode='extended')
tweets = cursor.items(5)

Did it work?

In [37]:
for tweet in tweets:
    print(tweet.full_text)
The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.co/ZwGHQtzFXr
This weeks minisode is live. 

Connie Mack Stadium/Shibe Park’s Final Game &amp; the Night Philly Fans Ripped the Park Apart

#Phillies #Eagles 

https://t.co/1SfJWDzpGQ
#Phillies Bryce Harper on Instagram with a welcome message to new #Sixers coach Doc Rivers
#PhilaUnite https://t.co/wki7vWNCF4
@BauerOutage @Reds #Phillies fan would embrace you! Not all things that happen here are bad!
Just checking in- does Klentak still have a job? Did we sign JT? #FireKlentak #signJT #Phillies

How to save the tweets?

Create a list of the tweets using Python's inline syntax

In [38]:
# select the next 10 tweets
tweets = [t for t in cursor.items(10)]

print(len(tweets))
10

A wealth of information is available

Beyond the text of the tweet, things like favorite and retweet counts are available. You can also extract info on the user behind the tweet.

In [39]:
first_tweet = tweets[0]
In [40]:
first_tweet.full_text
Out[40]:
'Listen to "White Supremacist Groups are Selling “Stand by” Tee Shirts. How Trump Doesn’t Leave the White House!" by Grandpa Jim. ⚓ https://t.co/HJxFaypiLw #meditation #meditations #phillies #pony #horses #derby #football #nyu #boston #nationals #nats #michigan #kansas'
In [41]:
first_tweet.favorite_count
Out[41]:
1
In [42]:
first_tweet.created_at
Out[42]:
datetime.datetime(2020, 10, 1, 20, 55, 1)
In [43]:
first_tweet.retweet_count
Out[43]:
0
In [44]:
first_tweet.user.description
Out[44]:
'Comedy College began in 1999, is home to "Standup Comedy 101" & improv classes. Students have appeared Jimmy Fallon, Conan, HBO, Comedy Central & more!'
In [45]:
first_tweet.user.followers_count
Out[45]:
10659

Let's extract screen names and locations

A fraction of tweets have locations associated with user profiles, giving (very rough) location data.

In [46]:
users_locs = [[tweet.user.screen_name, tweet.user.location] for tweet in tweets]
users_locs
Out[46]:
[['LaughOutNOW', 'Milwaukee-Chicago'],
 ['Philly__Nation', 'Will re-evaluate in 2-3 weeks'],
 ['mmaratea22', ''],
 ['PABlaylockTX', 'Plano, TX'],
 ['PaulaYWolf', 'Lancaster, Pa.'],
 ['Fumbleruski20', ''],
 ['blovepodcast', 'Philly'],
 ['BirdsBurner0', ''],
 ['bgluckma', 'Richmond, VA'],
 ['philliesbell', 'Philadelphia, PA, USA']]

Note: only about 1% of tweets have a latitude/longitude.

Difficult to extract geographic trends without pulling a large number of tweets, requiring a premium API.

Use case #1: calculating word frequencies

An example of text mining

Load the most recent 1,000 tweets

Save the text of 1,000 tweets after querying our cursor object.

In [47]:
cursor = tw.Cursor(api.search,
                   q="#phillies -filter:retweets",
                   lang="en",
                   tweet_mode='extended')
tweets = [tweet for tweet in cursor.items(1000)]
In [48]:
# get the text of the tweets
tweets_text = [tweet.full_text for tweet in tweets]
In [49]:
# the first five tweets
tweets_text[:5]
Out[49]:
['The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics https://t.co/ZwGHQtzFXr',
 'This weeks minisode is live. \n\nConnie Mack Stadium/Shibe Park’s Final Game &amp; the Night Philly Fans Ripped the Park Apart\n\n#Phillies #Eagles \n\nhttps://t.co/1SfJWDzpGQ',
 '#Phillies Bryce Harper on Instagram with a welcome message to new #Sixers coach Doc Rivers\n#PhilaUnite https://t.co/wki7vWNCF4',
 '@BauerOutage @Reds #Phillies fan would embrace you! Not all things that happen here are bad!',
 'Just checking in- does Klentak still have a job? Did we sign JT? #FireKlentak #signJT #Phillies']

Text mining and dealing with messy data

  1. Remove URLs $\rightarrow$ regular expressions
  2. Remove stop words
  3. Remove the search terms

Step 1: removing URLs

Regular expressions

image.png

This will identify "t.co" in URLs, e.g. https://t.co/Sp1Qtf5Fnl

Don't worry about mastering regular expression syntax...

StackOverflow is your friend

In [50]:
def remove_url(txt):
    """
    Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """
    import re
    return " ".join(re.sub("https://t.co/[A-Za-z\\d]+|&amp", "", txt).split())

Remove any URLs

In [51]:
tweets_no_urls = [remove_url(tweet) for tweet in tweets_text]
tweets_no_urls[:5]
Out[51]:
['The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics',
 'This weeks minisode is live. Connie Mack Stadium/Shibe Park’s Final Game ; the Night Philly Fans Ripped the Park Apart #Phillies #Eagles',
 '#Phillies Bryce Harper on Instagram with a welcome message to new #Sixers coach Doc Rivers #PhilaUnite',
 '@BauerOutage @Reds #Phillies fan would embrace you! Not all things that happen here are bad!',
 'Just checking in- does Klentak still have a job? Did we sign JT? #FireKlentak #signJT #Phillies']

Extract a list of lower-cased words in a tweet

  • .lower() makes all words lower cased
  • .split() splits a string into the individual words
In [52]:
"This is an Example".lower()
Out[52]:
'this is an example'
In [53]:
"This is an Example".lower().split()
Out[53]:
['this', 'is', 'an', 'example']

Apply these functions to all tweets:

In [54]:
words_in_tweet = [tweet.lower().split() for tweet in tweets_no_urls]
words_in_tweet[:2]
Out[54]:
[['the',
  'last',
  '@mlb',
  'game',
  'at',
  'connie',
  'mack',
  'stadium/shibe',
  'park',
  'took',
  'place',
  'fifty',
  'years',
  'ago,',
  'today.',
  '#phillies',
  '#athletics'],
 ['this',
  'weeks',
  'minisode',
  'is',
  'live.',
  'connie',
  'mack',
  'stadium/shibe',
  'park’s',
  'final',
  'game',
  ';',
  'the',
  'night',
  'philly',
  'fans',
  'ripped',
  'the',
  'park',
  'apart',
  '#phillies',
  '#eagles']]

Counting word frequencies

We'll define a helper function to calculate word frequencies from our lists of words.

In [55]:
def count_word_frequencies(words_in_tweet, top=15):
    """
    Given a list of all words for every tweet, count
    word frequencies across all tweets.
    
    By default, this returns the top 15 words, but you 
    can specify a different value for `top`.
    """
    import itertools, collections

    # List of all words across tweets
    all_words = list(itertools.chain(*words_in_tweet))

    # Create counter
    counter = collections.Counter(all_words)
    
    return pd.DataFrame(counter.most_common(top),
                        columns=['words', 'count'])
In [56]:
counts_no_urls = count_word_frequencies(words_in_tweet, top=15)
counts_no_urls.head(n=15)
Out[56]:
words count
0 the 1331
1 #phillies 934
2 to 527
3 a 479
4 and 441
5 in 424
6 of 351
7 for 230
8 is 196
9 i 178
10 with 169
11 on 161
12 this 158
13 that 133
14 klentak 124

Now let's plot the frequencies

Use seaborn to plot our DataFrame of word counts...

In [57]:
fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
sns.barplot(
    y="words",
    x="count",
    data=counts_no_urls.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Including All Words)", fontsize=16)
Out[57]:
Text(0.5, 1.0, 'Common Words Found in Tweets (Including All Words)')

Step 2: remove stop words and punctuation

Common words that do not carry much significance and are often ignored in text analysis.

We can use the nltk package.

The "Natural Language Toolkit" https://www.nltk.org/

Import and download the stop words

In [58]:
import nltk
nltk.download('stopwords');
[nltk_data] Downloading package stopwords to /Users/nhand/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Get the list of common stop words

In [59]:
stop_words = list(set(nltk.corpus.stopwords.words('english')))

stop_words[:10]
Out[59]:
['been',
 'there',
 'as',
 'in',
 'where',
 'yourselves',
 'mightn',
 "don't",
 'ourselves',
 'don']
In [61]:
len(stop_words)
Out[61]:
179

Get the list of common punctuation

In [62]:
import string
In [63]:
punctuation = list(string.punctuation)
In [64]:
punctuation[:5]
Out[64]:
['!', '"', '#', '$', '%']

Remove stop words from our tweets

In [65]:
ignored = stop_words + punctuation
In [67]:
ignored[:10]
Out[67]:
['been',
 'there',
 'as',
 'in',
 'where',
 'yourselves',
 'mightn',
 "don't",
 'ourselves',
 'don']
In [68]:
# Remove stop words from each tweet list of words
tweets_nsw = [[word for word in tweet_words if word not in ignored]
              for tweet_words in words_in_tweet]

tweets_nsw[0]
Out[68]:
['last',
 '@mlb',
 'game',
 'connie',
 'mack',
 'stadium/shibe',
 'park',
 'took',
 'place',
 'fifty',
 'years',
 'ago,',
 'today.',
 '#phillies',
 '#athletics']

Get our DataFrame of frequencies

In [69]:
counts_nsw = count_word_frequencies(tweets_nsw)
counts_nsw.head()
Out[69]:
words count
0 #phillies 934
1 klentak 124
2 @phillies 111
3 season 110
4 #eagles 108

And plot...

In [70]:
fig, ax = plt.subplots(figsize=(8, 8))

sns.barplot(
    y="words",
    x="count",
    data=counts_nsw.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Without Stop Words)", fontsize=16);

Step 3: remove our query terms

Now, we'll be left with only the meanigful words...

In [71]:
search_terms = ['#phillies', "phillies", "@phillies"]
tweets_final = [[w for w in word if w not in search_terms]
                 for word in tweets_nsw]
In [72]:
# frequency counts
counts_final = count_word_frequencies(tweets_final)

And now, plot the cleaned tweets...

In [73]:
fig, ax = plt.subplots(figsize=(8, 8))

sns.barplot(
    y="words",
    x="count",
    data=counts_final.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Cleaned)", fontsize=16)
Out[73]:
Text(0.5, 1.0, 'Common Words Found in Tweets (Cleaned)')

At home exercise

Get 1,000 tweets using a query string of your choice and plot the word frequencies.

Be sure to:

  • remove URLs
  • remove stop words / punctuation
  • remove your search query terms

Note: if you try to pull more than 1,000 tweets you will likely run into the rate limit and have to wait 15 minutes.

Remember: the API documentation describes how to customize a query string.

In [ ]:
 

Use case #2: sentiment analysis

The goal of a sentiment analysis is to determine the attitude or emotional state of the person who sent a particular tweet.

Often used by brands to evaluate public opinion about a product.

The goal:

Determine the "sentiment" of every word in the English language

The hard way

Train a machine learning algorithm to classify words as positive vs. negative, given an input training sample of words.

The easy way

Luckily, this is a very common task in NLP and there are several packages available that have done the hard work for you.

They provide out-of-the-box sentiment analysis using pre-trained machine learning algorithms.

We'll be using textblob

In [74]:
import textblob

Let's analyze our set of 1,000 #phillies tweets

Create our "text blobs"

Simply pass the tweet text to the TextBlob() object.

Note: it's best to remove any URLs first!

In [75]:
blobs = [textblob.TextBlob(remove_url(t.full_text)) for t in tweets]
In [76]:
blobs[0]
Out[76]:
TextBlob("The last @MLB game at Connie Mack Stadium/Shibe Park took place fifty years ago, today. #Phillies #Athletics")
In [77]:
blobs[0].sentiment
Out[77]:
Sentiment(polarity=-0.2, subjectivity=0.23333333333333334)

Combine the data into a DataFrame

Track the polarity, subjectivity, and date of each tweet.

In [78]:
data = {}
data['date'] = [t.created_at for t in tweets]
data['polarity'] = [b.sentiment.polarity for b in blobs]
data['subjectivity'] = [b.sentiment.subjectivity for b in blobs]
data['text'] = [remove_url(t.full_text) for t in tweets]
data = pd.DataFrame(data)
In [79]:
data.head()
Out[79]:
date polarity subjectivity text
0 2020-10-02 00:25:46 -0.200000 0.233333 The last @MLB game at Connie Mack Stadium/Shib...
1 2020-10-01 23:42:17 -0.087879 0.633333 This weeks minisode is live. Connie Mack Stadi...
2 2020-10-01 23:17:33 0.468182 0.677273 #Phillies Bryce Harper on Instagram with a wel...
3 2020-10-01 23:13:55 -0.875000 0.666667 @BauerOutage @Reds #Phillies fan would embrace...
4 2020-10-01 22:59:07 0.000000 0.000000 Just checking in- does Klentak still have a jo...

How many are unbiased?

We can remove tweets with a polarity of zero to get a better sense of emotions.

In [80]:
zero = (data['polarity']==0).sum()
print("number of unbiased tweets = ", zero)
number of unbiased tweets =  360
In [81]:
# remove unbiased tweets
biased = data.loc[ data['polarity'] != 0 ].copy()

What does a polarized tweet look like?

We can find the tweet with the maximum positive/negative scores

The most negative

Use the idxmin() function:

In [82]:
biased.loc[biased['polarity'].idxmin(), 'text']
Out[82]:
'The #indians Bullpen is Horrendous Its Probably worst Than the #Phillies'

The most positive

Use the idxmax() function

In [83]:
biased.loc[biased['polarity'].idxmax(), 'text']
Out[83]:
'Great Hire by the @sixers! Now @Phillies fire Matt Klentak #Phillies #Sixers #WIP'

Plot a histogram of polarity

We can use matplotlib's hist() function:

In [84]:
# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# histogram
ax.hist(biased['polarity'], bins='auto')
ax.axvline(x=0, c='k', lw=2)

# format
ax.set_xlabel("Polarity")
ax.set_title("Polarity of #phillies Tweets", fontsize=16);

And subjectivity too...

The most objective

In [85]:
biased.loc[biased['subjectivity'].idxmin(), 'text']
Out[85]:
'Literally don’t know what was harder to watch the #Phillies bullpen this season or the #Debates2020 #PresidentialDebate2020 JESUS. I can’t believe this is what it has come to, can’t even vote but WHERES BERNIE?!?'

The most subjective

In [86]:
biased.loc[biased['subjectivity'].idxmax(), 'text']
Out[86]:
'Happy Birthday Luis Garcia! #Phillies'

The distribution of subjectivity

In [87]:
# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# histogram
ax.hist(biased['subjectivity'], bins='auto')
ax.axvline(x=0.5, c='k', lw=2)

# format
ax.set_xlabel("Subjectivity")
ax.set_title("Subjectivity of #phillies Tweets", fontsize=16);

How does polarity influence subjectivity?

Are positive/negative tweets more or less objective?

Seaborn's regplot() function

Is there a linear trend?

In [88]:
ax = sns.regplot(x=biased['subjectivity'], y=biased['polarity'])

Seaborn's kdeplot()

Shade the bivariate relationship

In [89]:
ax = sns.kdeplot(data=biased['subjectivity'], data2=biased['polarity'])

Insight: the most subjective tweets tend to be most polarized as well...

We can plot the distribution of polarity by the tweet's hour

First, we'll add a new column that gives the day and hour of the tweet.

We can use the built-in strftime() function.

In [90]:
# this is month/day hour AM/PM
biased['date_string'] = biased['date'].dt.strftime("%-m/%d %I %p")
In [91]:
biased.head()
Out[91]:
date polarity subjectivity text date_string
0 2020-10-02 00:25:46 -0.200000 0.233333 The last @MLB game at Connie Mack Stadium/Shib... 10/02 12 AM
1 2020-10-01 23:42:17 -0.087879 0.633333 This weeks minisode is live. Connie Mack Stadi... 10/01 11 PM
2 2020-10-01 23:17:33 0.468182 0.677273 #Phillies Bryce Harper on Instagram with a wel... 10/01 11 PM
3 2020-10-01 23:13:55 -0.875000 0.666667 @BauerOutage @Reds #Phillies fan would embrace... 10/01 11 PM
6 2020-10-01 22:38:30 0.383333 0.525000 To go full SAT-analogy on this, Rivers is to t... 10/01 10 PM

Sort the tweets in chronological order...

In [92]:
biased = biased.sort_values(by='date', ascending=True)

Make a box and whiskers plot of the polarity

Use Seaborn's boxplot() function

In [93]:
ax = sns.boxplot(y='date_string', x='polarity', data=biased)
ax.axvline(x=0, c='k', lw=2) # neutral

# format
plt.setp(ax.get_yticklabels(), fontsize=10)
ax.figure.set_size_inches((8,12))

And subjectivity over time...

In [94]:
ax = sns.boxplot(y='date_string', x='subjectivity', data=biased)
ax.axvline(x=0.5, c='k', lw=2) # neutral

# format
plt.setp(ax.get_yticklabels(), fontsize=10)
ax.figure.set_size_inches((8,14))

At home exercise: sentiment analysis

Analyze your set of tweets from the last exercise (or get a new set), and explore the sentiments by:

  • plotting histograms of the subjectivity and polarity
  • finding the most/least subjective and polarized tweets
  • plotting the relationship between polarity and subjectivity
  • showing hourly trends in polarity/subjectivity

Or explore trends in some new way!

In [ ]:
 

That's it!

  • Next week: creating your own datasets through web scraping
  • Pre-recorded lecture will be posted on Sunday/Monday
  • See you next Thursday!