What do you do with 280 million tweets from the 2016 U.S. election?

What do you do with
280 million tweets from the
2016 U.S. election?
Justin Littman
April 25, 2018

Overview
● Outline of the dataset
● Collecting the dataset
○ Social Feed Manager
● Sharing the dataset
○ TweetSets
● Uses of the dataset
● Plans for 2018 U.S. election

Datasets
Filter stream:
● Candidates and key election
hashtags
● Democratic Convention
● GOP Convention
● First presidential debate
● Second presidential debate
● Third presidential debate
● Vice-presidential debate
● Election Day
User timelines:
● Democratic candidates
● Democratic Party
● Republican candidates
● Republican Party

Candidates and key election hashtags
● Track: election2016, election, clinton, kaine, trump, pence
● Follow: @realDonaldTrump, @HillaryClinton, @timkaine,
@mike_pence
● 251,077,140 tweets
● July 13, 2016 - November 10, 2016

Democratic Convention
● Track: philly convention, philadelphia convention,
democratic convention, dnc convention, #demsinphilly,
#dnc, #philly, #demconvention
● Follow: @DemConvention, @TheDemocrats
● 8,340,668 tweets
● July 22, 2016 - July 30, 2016

Democratic Candidates
● Accounts: @BernieSanders, @HillaryClinton,
@MartinOMalley, @SenSanders, @timkaine
● 22,251 tweets
● Collected every week

Top tweeters
561k tweets, 15 followers
suspended
deleted
deleted
tweets primarily in Greek
577k tweets, last tweeted Nov 7, 2017
126k tweets, 5 followers
deleted
still tweeting (915k) at non-human rates

Top mentions
Where is @timkaine?

Top hashtags
Republicans clearly
out-hashtagged the
Democrats.

Top URLs
spam
spam
gone
gone
gone
gone

Social Feed Manager (SFM)
● Open source software by GW Libraries.
● User interface for collecting, managing & exporting social
media data.
● Goal: Lower the technical barriers for collecting social
media data for academic research and archiving.
● Supports Twitter, Tumblr, Flickr & Sina Weibo.
● Intended for organizations to run for their users.
go.gwu.edu/sfm

Step 1b: Describe the collection

Step 1c: Specify what is to be collected

Collecting got
off to a rough start ...

Dataset caveats: Holes
Candidates and key election hashtags dataset by week
Family road trip to Michigan &
Canada. We loved Toronto!

Dataset caveats: Rate limits
Tweet rate (by minute) from Democratic Convention
Rate limit plateau

Dataset caveats: Non-U.S. election tweets

Sharing the dataset
● Twitter’s developer policies require sharing tweet ids only.
● Complete tweets can be “hydrated” from Twitter API.
○ Hydrating complete dataset takes about a month.
● Tweets that are deleted or from accounts that are
protected, deleted, or suspended are not available.
● Provides a “right to be forgotten” but also:
○ Complicates reproducible research
○ Difficult to hold politicians accountable, research bots.
● However, share complete tweets within university.

Sharing the dataset: Harvard’s Dataverse
doi.org/10.7910/DVN/PDI7IN

Sharing the dataset: Harvard’s Dataverse
● Almost 3,000 downloads (as of mid-2018).
● Each collection has a README.
→ Interested in collaborating on best practices for sharing
datasets.

Sharing the dataset: TweetSets
● Open source software by GW Libraries.
● Basic idea: Reuse existing datasets, but allow to filter /
query for only the tweets that are needed.
● Conforms with Twitter policies.
○ Within university: Complete tweets
○ Public: Tweet ids only
tweetsets.library.gwu.edu

TweetSets step 1: Select source datasets

TweetSets step 2a: Query the tweets in datasets

TweetSets step 2a: Query the tweets in datasets
● Tweet text
● Hashtags
● Mentions
● Posted by
● In reply to
● Tweet type
● Created at
● URL
● Has image
● Is geotagged
Also, query by:

TweetSets step 2b View summary statistics

TweetSets step 2c: View sample tweets

TweetSets step 3: Create a dataset

Academic research
● Clare H. Liu, “Applications of Twitter Emotion Detection for
Stock Market Prediction.” Masters thesis at MIT.
● David Anuta, Josh Churchin & Jiebo Luo, “Election Bias:
Comparing Polls and Twitter in the 2016 U.S. Election.”
● Sicheng Zhao, Yue Gao, Guiguang Ding & Tat-Seng
Chua. “Real-Time Multimedia Social Event Detection in
Microblog.” IEEE Transactions on Cybernetics.
● Ahsen J. Uppal & H. Howie Huang, “Event Prediction from
Dynamic Communities in Social Networks.”

Journalists
● Significant interest in dataset after release of list of IRA
accounts by Senate Intelligence Committee.
● We identified 36,210 tweets from these accounts.
● Sharing these deleted tweets violates Twitter policy.
● University weighed public interest vs. risk of losing access
to Twitter API for GW researchers.
● See
nbcnews.com/tech/social-media/now-available-more-200-
000-deleted-russian-troll-tweets-n844731

Deleted tweets research
● With Catie Bailard (School of Media & Public Affairs,
GWU) & Andy Hoagland (data scientist)
● Possible research questions:
○ What is the substantive content of deleted vs. extant tweets about
the candidate(s)?
○ What was the relative distribution of deleted / extant tweets in
terms of the proportion that were pro- / anti- Hillary / Trump?
○ Were tweets with a certain type of content more likely to be
deleted than those with other types of content?

● Possible research questions:
○ What portion of tweets deleted by Twitter were likely-bots vs.
likely-humans? Were there differences in the substantive content
of deleted tweets generated by likely-humans versus likely-bots?

● 92 million tweets from October 8th and November 8th
2016 which contain “Clinton,” “Trump,” “Donald,” “Hillary,”
“@realDonaldTrump” or “@HillaryClinton”.
● Split deleted tweets from extant tweets.
○ 22 million tweets (24%) were deleted
● Created 10% sample of deleted tweets & 1.5% sample of
extant tweets.

● For each tweet in deleted tweets sample, determined
reason for deletion.
○ For example: user suspended, original user suspended, tweet
deleted
● For each user in each of the samples, ran bot detection.
○ Botometer, using API.
○ Used tweets from full dataset, rather than live Twitter.
○ Not all users had enough tweets.

● Performing content analysis of 3000 tweets.
○ Coding for overall “gist” (anti-Trump, anti-Hillary, pro-Trump,
and/or pro-Hillary), specific subject matter (e.g., criticizes
candidate’s personal qualities or past actions, calls-to-action),
identity (e.g., race, gender), more.
○ Three humans code each tweet using DiscoverText.
○ Average Krippendorff’s Alpha score 0.73.
● Will use neural network machine learning to generalize to
larger dataset.

Botometer scores for deleted tweets

Plans for #election2018: Currently collecting
● Neutral: #Nov2018, #Election2018, #Nov18, #Election18,
#Midterms2018, #Midterms18, #Midterm2018,
#Midterm18, #midtermelection, #election, #vote, 2018
election, election 2018, midterm election
● Partisan Republican: #trump, #maga, #gop, #republican,
#trumptrain, #kag
● Partisan Democrat: #bluewave2018, #bluewave18,
#bluewave, #democrats, #resist, #resistance

Plans for #election2018: Currently collecting
● Top accounts
○ 5,000+ accounts extracted from neutral collection because a top
tweeter, retweeted account, or mentioned account.
○ Add new accounts every week from rolling 2 weeks of tweets.
○ Already seeing significant churn as accounts are suspended.

Plans for #election2018:
● Individual candidates
● Local parties
● Local hashtags
→ Currently in discussions with a news organization to
collaborate on identifying these accounts / hashtags.
→ Thinking about how to “cut through noise” to collect tweets
from citizens.
→ Working on contemporaneous web archiving of linked web
resources and media.

#election2018: Topic Tracker
bit.ly/2J0EKFj

Questions?
More info:
● go.gwu.edu/gwsfm
● @SocialFeedMgr
● sfm@gwu.edu
Or:
● @justin_littman
● justinlittman@gwu.edu

What do you do with 280 million tweets from the 2016 U.S. election?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What do you do with 280 million tweets from the 2016 U.S. election?

Similar to What do you do with 280 million tweets from the 2016 U.S. election? (20)

Recently uploaded

Recently uploaded (20)

What do you do with 280 million tweets from the 2016 U.S. election?