A talk at Social Media Lab, Ryerson University on April 25, 2018 discussing the 2016 U.S. election Twitter dataset collected at George Washington University Libraries.
What do you do with 280 million tweets from the 2016 U.S. election?
1. What do you do with
280 million tweets from the
2016 U.S. election?
Justin Littman
April 25, 2018
2. Overview
● Outline of the dataset
● Collecting the dataset
○ Social Feed Manager
● Sharing the dataset
○ TweetSets
● Uses of the dataset
● Plans for 2018 U.S. election
15. Social Feed Manager (SFM)
● Open source software by GW Libraries.
● User interface for collecting, managing & exporting social
media data.
● Goal: Lower the technical barriers for collecting social
media data for academic research and archiving.
● Supports Twitter, Tumblr, Flickr & Sina Weibo.
● Intended for organizations to run for their users.
go.gwu.edu/sfm
27. Sharing the dataset
● Twitter’s developer policies require sharing tweet ids only.
● Complete tweets can be “hydrated” from Twitter API.
○ Hydrating complete dataset takes about a month.
● Tweets that are deleted or from accounts that are
protected, deleted, or suspended are not available.
● Provides a “right to be forgotten” but also:
○ Complicates reproducible research
○ Difficult to hold politicians accountable, research bots.
● However, share complete tweets within university.
29. Sharing the dataset: Harvard’s Dataverse
● Almost 3,000 downloads (as of mid-2018).
● Each collection has a README.
→ Interested in collaborating on best practices for sharing
datasets.
30. Sharing the dataset: TweetSets
● Open source software by GW Libraries.
● Basic idea: Reuse existing datasets, but allow to filter /
query for only the tweets that are needed.
● Conforms with Twitter policies.
○ Within university: Complete tweets
○ Public: Tweet ids only
tweetsets.library.gwu.edu
33. TweetSets step 2a: Query the tweets in datasets
● Tweet text
● Hashtags
● Mentions
● Posted by
● In reply to
● Tweet type
● Created at
● URL
● Has image
● Is geotagged
Also, query by:
39. Academic research
● Clare H. Liu, “Applications of Twitter Emotion Detection for
Stock Market Prediction.” Masters thesis at MIT.
● David Anuta, Josh Churchin & Jiebo Luo, “Election Bias:
Comparing Polls and Twitter in the 2016 U.S. Election.”
● Sicheng Zhao, Yue Gao, Guiguang Ding & Tat-Seng
Chua. “Real-Time Multimedia Social Event Detection in
Microblog.” IEEE Transactions on Cybernetics.
● Ahsen J. Uppal & H. Howie Huang, “Event Prediction from
Dynamic Communities in Social Networks.”
40. Journalists
● Significant interest in dataset after release of list of IRA
accounts by Senate Intelligence Committee.
● We identified 36,210 tweets from these accounts.
● Sharing these deleted tweets violates Twitter policy.
● University weighed public interest vs. risk of losing access
to Twitter API for GW researchers.
● See
nbcnews.com/tech/social-media/now-available-more-200-
000-deleted-russian-troll-tweets-n844731
41. Deleted tweets research
● With Catie Bailard (School of Media & Public Affairs,
GWU) & Andy Hoagland (data scientist)
● Possible research questions:
○ What is the substantive content of deleted vs. extant tweets about
the candidate(s)?
○ What was the relative distribution of deleted / extant tweets in
terms of the proportion that were pro- / anti- Hillary / Trump?
○ Were tweets with a certain type of content more likely to be
deleted than those with other types of content?
42. Deleted tweets research
● Possible research questions:
○ What portion of tweets deleted by Twitter were likely-bots vs.
likely-humans? Were there differences in the substantive content
of deleted tweets generated by likely-humans versus likely-bots?
43. Deleted tweets research
● 92 million tweets from October 8th and November 8th
2016 which contain “Clinton,” “Trump,” “Donald,” “Hillary,”
“@realDonaldTrump” or “@HillaryClinton”.
● Split deleted tweets from extant tweets.
○ 22 million tweets (24%) were deleted
● Created 10% sample of deleted tweets & 1.5% sample of
extant tweets.
44. Deleted tweets research
● For each tweet in deleted tweets sample, determined
reason for deletion.
○ For example: user suspended, original user suspended, tweet
deleted
● For each user in each of the samples, ran bot detection.
○ Botometer, using API.
○ Used tweets from full dataset, rather than live Twitter.
○ Not all users had enough tweets.
45. Deleted tweets research
● Performing content analysis of 3000 tweets.
○ Coding for overall “gist” (anti-Trump, anti-Hillary, pro-Trump,
and/or pro-Hillary), specific subject matter (e.g., criticizes
candidate’s personal qualities or past actions, calls-to-action),
identity (e.g., race, gender), more.
○ Three humans code each tweet using DiscoverText.
○ Average Krippendorff’s Alpha score 0.73.
● Will use neural network machine learning to generalize to
larger dataset.
50. Plans for #election2018: Currently collecting
● Top accounts
○ 5,000+ accounts extracted from neutral collection because a top
tweeter, retweeted account, or mentioned account.
○ Add new accounts every week from rolling 2 weeks of tweets.
○ Already seeing significant churn as accounts are suspended.
51. Plans for #election2018:
● Individual candidates
● Local parties
● Local hashtags
→ Currently in discussions with a news organization to
collaborate on identifying these accounts / hashtags.
→ Thinking about how to “cut through noise” to collect tweets
from citizens.
→ Working on contemporaneous web archiving of linked web
resources and media.