SlideShare a Scribd company logo
What do you do with
280 million tweets from the
2016 U.S. election?
Justin Littman
April 25, 2018
Overview
● Outline of the dataset
● Collecting the dataset
○ Social Feed Manager
● Sharing the dataset
○ TweetSets
● Uses of the dataset
● Plans for 2018 U.S. election
Outline of the dataset
Datasets
Filter stream:
● Candidates and key election
hashtags
● Democratic Convention
● GOP Convention
● First presidential debate
● Second presidential debate
● Third presidential debate
● Vice-presidential debate
● Election Day
User timelines:
● Democratic candidates
● Democratic Party
● Republican candidates
● Republican Party
Candidates and key election hashtags
● Track: election2016, election, clinton, kaine, trump, pence
● Follow: @realDonaldTrump, @HillaryClinton, @timkaine,
@mike_pence
● 251,077,140 tweets
● July 13, 2016 - November 10, 2016
Democratic Convention
● Track: philly convention, philadelphia convention,
democratic convention, dnc convention, #demsinphilly,
#dnc, #philly, #demconvention
● Follow: @DemConvention, @TheDemocrats
● 8,340,668 tweets
● July 22, 2016 - July 30, 2016
Democratic Candidates
● Accounts: @BernieSanders, @HillaryClinton,
@MartinOMalley, @SenSanders, @timkaine
● 22,251 tweets
● Collected every week
Tweet types
Most retweeted
Top tweeters
561k tweets, 15 followers
suspended
deleted
deleted
tweets primarily in Greek
577k tweets, last tweeted Nov 7, 2017
126k tweets, 5 followers
deleted
still tweeting (915k) at non-human rates
Top mentions
Where is @timkaine?
Top hashtags
Republicans clearly
out-hashtagged the
Democrats.
Top URLs
spam
spam
gone
gone
gone
gone
Collecting the dataset
Social Feed Manager (SFM)
● Open source software by GW Libraries.
● User interface for collecting, managing & exporting social
media data.
● Goal: Lower the technical barriers for collecting social
media data for academic research and archiving.
● Supports Twitter, Tumblr, Flickr & Sina Weibo.
● Intended for organizations to run for their users.
go.gwu.edu/sfm
Step 1a: Create a collection
Step 1b: Describe the collection
Step 1c: Specify what is to be collected
Step 2: Turn on collecting
Step 3: Monitor collecting
Step 4: Export
Collecting got
off to a rough start ...
Dataset caveats: Holes
Candidates and key election hashtags dataset by week
Family road trip to Michigan &
Canada. We loved Toronto!
Dataset caveats: Rate limits
Tweet rate (by minute) from Democratic Convention
Rate limit plateau
Dataset caveats: Non-U.S. election tweets
Sharing the dataset
Sharing the dataset
● Twitter’s developer policies require sharing tweet ids only.
● Complete tweets can be “hydrated” from Twitter API.
○ Hydrating complete dataset takes about a month.
● Tweets that are deleted or from accounts that are
protected, deleted, or suspended are not available.
● Provides a “right to be forgotten” but also:
○ Complicates reproducible research
○ Difficult to hold politicians accountable, research bots.
● However, share complete tweets within university.
Sharing the dataset: Harvard’s Dataverse
doi.org/10.7910/DVN/PDI7IN
Sharing the dataset: Harvard’s Dataverse
● Almost 3,000 downloads (as of mid-2018).
● Each collection has a README.
→ Interested in collaborating on best practices for sharing
datasets.
Sharing the dataset: TweetSets
● Open source software by GW Libraries.
● Basic idea: Reuse existing datasets, but allow to filter /
query for only the tweets that are needed.
● Conforms with Twitter policies.
○ Within university: Complete tweets
○ Public: Tweet ids only
tweetsets.library.gwu.edu
TweetSets step 1: Select source datasets
TweetSets step 2a: Query the tweets in datasets
TweetSets step 2a: Query the tweets in datasets
● Tweet text
● Hashtags
● Mentions
● Posted by
● In reply to
● Tweet type
● Created at
● URL
● Has image
● Is geotagged
Also, query by:
TweetSets step 2b View summary statistics
TweetSets step 2c: View sample tweets
TweetSets step 3: Create a dataset
TweetSets step 4: Export
Uses of the dataset
Academic research
● Clare H. Liu, “Applications of Twitter Emotion Detection for
Stock Market Prediction.” Masters thesis at MIT.
● David Anuta, Josh Churchin & Jiebo Luo, “Election Bias:
Comparing Polls and Twitter in the 2016 U.S. Election.”
● Sicheng Zhao, Yue Gao, Guiguang Ding & Tat-Seng
Chua. “Real-Time Multimedia Social Event Detection in
Microblog.” IEEE Transactions on Cybernetics.
● Ahsen J. Uppal & H. Howie Huang, “Event Prediction from
Dynamic Communities in Social Networks.”
Journalists
● Significant interest in dataset after release of list of IRA
accounts by Senate Intelligence Committee.
● We identified 36,210 tweets from these accounts.
● Sharing these deleted tweets violates Twitter policy.
● University weighed public interest vs. risk of losing access
to Twitter API for GW researchers.
● See
nbcnews.com/tech/social-media/now-available-more-200-
000-deleted-russian-troll-tweets-n844731
Deleted tweets research
● With Catie Bailard (School of Media & Public Affairs,
GWU) & Andy Hoagland (data scientist)
● Possible research questions:
○ What is the substantive content of deleted vs. extant tweets about
the candidate(s)?
○ What was the relative distribution of deleted / extant tweets in
terms of the proportion that were pro- / anti- Hillary / Trump?
○ Were tweets with a certain type of content more likely to be
deleted than those with other types of content?
Deleted tweets research
● Possible research questions:
○ What portion of tweets deleted by Twitter were likely-bots vs.
likely-humans? Were there differences in the substantive content
of deleted tweets generated by likely-humans versus likely-bots?
Deleted tweets research
● 92 million tweets from October 8th and November 8th
2016 which contain “Clinton,” “Trump,” “Donald,” “Hillary,”
“@realDonaldTrump” or “@HillaryClinton”.
● Split deleted tweets from extant tweets.
○ 22 million tweets (24%) were deleted
● Created 10% sample of deleted tweets & 1.5% sample of
extant tweets.
Deleted tweets research
● For each tweet in deleted tweets sample, determined
reason for deletion.
○ For example: user suspended, original user suspended, tweet
deleted
● For each user in each of the samples, ran bot detection.
○ Botometer, using API.
○ Used tweets from full dataset, rather than live Twitter.
○ Not all users had enough tweets.
Deleted tweets research
● Performing content analysis of 3000 tweets.
○ Coding for overall “gist” (anti-Trump, anti-Hillary, pro-Trump,
and/or pro-Hillary), specific subject matter (e.g., criticizes
candidate’s personal qualities or past actions, calls-to-action),
identity (e.g., race, gender), more.
○ Three humans code each tweet using DiscoverText.
○ Average Krippendorff’s Alpha score 0.73.
● Will use neural network machine learning to generalize to
larger dataset.
Delete reasons
Botometer scores for deleted tweets
Plans for 2018 election
Plans for #election2018: Currently collecting
● Neutral: #Nov2018, #Election2018, #Nov18, #Election18,
#Midterms2018, #Midterms18, #Midterm2018,
#Midterm18, #midtermelection, #election, #vote, 2018
election, election 2018, midterm election
● Partisan Republican: #trump, #maga, #gop, #republican,
#trumptrain, #kag
● Partisan Democrat: #bluewave2018, #bluewave18,
#bluewave, #democrats, #resist, #resistance
Plans for #election2018: Currently collecting
● Top accounts
○ 5,000+ accounts extracted from neutral collection because a top
tweeter, retweeted account, or mentioned account.
○ Add new accounts every week from rolling 2 weeks of tweets.
○ Already seeing significant churn as accounts are suspended.
Plans for #election2018:
● Individual candidates
● Local parties
● Local hashtags
→ Currently in discussions with a news organization to
collaborate on identifying these accounts / hashtags.
→ Thinking about how to “cut through noise” to collect tweets
from citizens.
→ Working on contemporaneous web archiving of linked web
resources and media.
#election2018: Topic Tracker
bit.ly/2J0EKFj
Questions?
More info:
● go.gwu.edu/gwsfm
● @SocialFeedMgr
● sfm@gwu.edu
Or:
● @justin_littman
● justinlittman@gwu.edu

More Related Content

What's hot

Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
Diana Maynard
 
Practicing Data Science Responsibly
Practicing Data Science ResponsiblyPracticing Data Science Responsibly
Practicing Data Science Responsibly
rahulbot
 
DMAP: Data Aggregation and Presentation Framework
DMAP: Data Aggregation and Presentation FrameworkDMAP: Data Aggregation and Presentation Framework
DMAP: Data Aggregation and Presentation Framework
Parang Saraf
 
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
CSCJournals
 
Social media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchSocial media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for research
Toronto Metropolitan University
 
The language of social media
The language of social mediaThe language of social media
The language of social media
Diana Maynard
 
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington
 
Grounded theory meets big data: One way to marry ethnography and digital methods
Grounded theory meets big data: One way to marry ethnography and digital methodsGrounded theory meets big data: One way to marry ethnography and digital methods
Grounded theory meets big data: One way to marry ethnography and digital methods
Citizens in the Making
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
Piet J.H. Daas
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYCOpen Analytics
 
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Rich Heimann
 
Guestlecture on #bigdata
Guestlecture on #bigdataGuestlecture on #bigdata
BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
Identifying Influencers on Social Media Using Social Network Analysis
Identifying Influencers on Social Media Using Social Network AnalysisIdentifying Influencers on Social Media Using Social Network Analysis
Identifying Influencers on Social Media Using Social Network Analysis
Felipe Bonow Soares
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysisTaylor Graham
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
Rich Heimann
 
Term=machine+learning - Experiments in #textanalysis
Term=machine+learning - Experiments in #textanalysisTerm=machine+learning - Experiments in #textanalysis
Term=machine+learning - Experiments in #textanalysis
Suresh Manian
 
DIY basic Facebook data mining
DIY basic Facebook data miningDIY basic Facebook data mining
DIY basic Facebook data mining
STEM/MARK
 
Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)
Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)
Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)
Justin Littman
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Deepak K
 

What's hot (20)

Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
 
Practicing Data Science Responsibly
Practicing Data Science ResponsiblyPracticing Data Science Responsibly
Practicing Data Science Responsibly
 
DMAP: Data Aggregation and Presentation Framework
DMAP: Data Aggregation and Presentation FrameworkDMAP: Data Aggregation and Presentation Framework
DMAP: Data Aggregation and Presentation Framework
 
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
 
Social media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchSocial media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for research
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
Toward Automated Fact-Checking: Detecting Check-worthy Factual Claims by Clai...
 
Grounded theory meets big data: One way to marry ethnography and digital methods
Grounded theory meets big data: One way to marry ethnography and digital methodsGrounded theory meets big data: One way to marry ethnography and digital methods
Grounded theory meets big data: One way to marry ethnography and digital methods
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
 
Guestlecture on #bigdata
Guestlecture on #bigdataGuestlecture on #bigdata
Guestlecture on #bigdata
 
BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
BD-ACA Week8a
 
Identifying Influencers on Social Media Using Social Network Analysis
Identifying Influencers on Social Media Using Social Network AnalysisIdentifying Influencers on Social Media Using Social Network Analysis
Identifying Influencers on Social Media Using Social Network Analysis
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysis
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 
Term=machine+learning - Experiments in #textanalysis
Term=machine+learning - Experiments in #textanalysisTerm=machine+learning - Experiments in #textanalysis
Term=machine+learning - Experiments in #textanalysis
 
DIY basic Facebook data mining
DIY basic Facebook data miningDIY basic Facebook data mining
DIY basic Facebook data mining
 
Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)
Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)
Presentation at National Forum on Ethics & Archiving the Web (March 23, 2018)
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 

Similar to What do you do with 280 million tweets from the 2016 U.S. election?

Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Cybersecurity Education and Research Centre
 
Automated Analysis of Journalists' and Politicians' Online Behavior on Social...
Automated Analysis of Journalists' and Politicians' Online Behavior on Social...Automated Analysis of Journalists' and Politicians' Online Behavior on Social...
Automated Analysis of Journalists' and Politicians' Online Behavior on Social...
University of Groningen (The Netherlands)
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Farida Vis
 
Social Feed Manager: Developing Software and Offering Services to Support Soc...
Social Feed Manager: Developing Software and Offering Services to Support Soc...Social Feed Manager: Developing Software and Offering Services to Support Soc...
Social Feed Manager: Developing Software and Offering Services to Support Soc...
Laura Wrubel
 
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA ProjectA Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
Axel Bruns
 
Research with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical ConsiderationsResearch with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical Considerations
Toronto Metropolitan University
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social Network
George Konstantakopoulos
 
Data augmented ethnography: 
using big data and ethnography to explore candi...
Data augmented ethnography: 
using big data and ethnography  to explore candi...Data augmented ethnography: 
using big data and ethnography  to explore candi...
Data augmented ethnography: 
using big data and ethnography to explore candi...
Salla-Maaria Laaksonen
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
RESHAN FARAZ
 
Who are We Studying: Humans or Bots?
Who are We Studying: Humans or Bots? Who are We Studying: Humans or Bots?
Who are We Studying: Humans or Bots?
Toronto Metropolitan University
 
Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?
Axel Bruns
 
Thesis oral defense 2015 elvis saravia
Thesis oral defense 2015  elvis saraviaThesis oral defense 2015  elvis saravia
Thesis oral defense 2015 elvis saravia
Elvis Saravia
 
Challenges in-archiving-twitter
Challenges in-archiving-twitterChallenges in-archiving-twitter
Challenges in-archiving-twitter
Katrin Weller
 
IRJET - Political Orientation Prediction using Social Media Activity
IRJET -  	  Political Orientation Prediction using Social Media ActivityIRJET -  	  Political Orientation Prediction using Social Media Activity
IRJET - Political Orientation Prediction using Social Media Activity
IRJET Journal
 
User Classification of Organization and Organization Affiliated Users during ...
User Classification of Organization and Organization Affiliated Users during ...User Classification of Organization and Organization Affiliated Users during ...
User Classification of Organization and Organization Affiliated Users during ...
Hemant Purohit
 
Social Media Data Analytics
Social Media Data AnalyticsSocial Media Data Analytics
Social Media Data Analytics
Dr.(Mrs).Gethsiyal Augasta
 
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
cscpconf
 
Geo-information and Twitter Use
Geo-information and Twitter UseGeo-information and Twitter Use
Geo-information and Twitter UseHan Woo PARK
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptx
DivyaPatel729457
 
Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...
Diana Maynard
 

Similar to What do you do with 280 million tweets from the 2016 U.S. election? (20)

Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
 
Automated Analysis of Journalists' and Politicians' Online Behavior on Social...
Automated Analysis of Journalists' and Politicians' Online Behavior on Social...Automated Analysis of Journalists' and Politicians' Online Behavior on Social...
Automated Analysis of Journalists' and Politicians' Online Behavior on Social...
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
 
Social Feed Manager: Developing Software and Offering Services to Support Soc...
Social Feed Manager: Developing Software and Offering Services to Support Soc...Social Feed Manager: Developing Software and Offering Services to Support Soc...
Social Feed Manager: Developing Software and Offering Services to Support Soc...
 
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA ProjectA Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
 
Research with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical ConsiderationsResearch with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical Considerations
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social Network
 
Data augmented ethnography: 
using big data and ethnography to explore candi...
Data augmented ethnography: 
using big data and ethnography  to explore candi...Data augmented ethnography: 
using big data and ethnography  to explore candi...
Data augmented ethnography: 
using big data and ethnography to explore candi...
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
 
Who are We Studying: Humans or Bots?
Who are We Studying: Humans or Bots? Who are We Studying: Humans or Bots?
Who are We Studying: Humans or Bots?
 
Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?
 
Thesis oral defense 2015 elvis saravia
Thesis oral defense 2015  elvis saraviaThesis oral defense 2015  elvis saravia
Thesis oral defense 2015 elvis saravia
 
Challenges in-archiving-twitter
Challenges in-archiving-twitterChallenges in-archiving-twitter
Challenges in-archiving-twitter
 
IRJET - Political Orientation Prediction using Social Media Activity
IRJET -  	  Political Orientation Prediction using Social Media ActivityIRJET -  	  Political Orientation Prediction using Social Media Activity
IRJET - Political Orientation Prediction using Social Media Activity
 
User Classification of Organization and Organization Affiliated Users during ...
User Classification of Organization and Organization Affiliated Users during ...User Classification of Organization and Organization Affiliated Users during ...
User Classification of Organization and Organization Affiliated Users during ...
 
Social Media Data Analytics
Social Media Data AnalyticsSocial Media Data Analytics
Social Media Data Analytics
 
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
 
Geo-information and Twitter Use
Geo-information and Twitter UseGeo-information and Twitter Use
Geo-information and Twitter Use
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptx
 
Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...
 

Recently uploaded

This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 

Recently uploaded (20)

This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 

What do you do with 280 million tweets from the 2016 U.S. election?

  • 1. What do you do with 280 million tweets from the 2016 U.S. election? Justin Littman April 25, 2018
  • 2. Overview ● Outline of the dataset ● Collecting the dataset ○ Social Feed Manager ● Sharing the dataset ○ TweetSets ● Uses of the dataset ● Plans for 2018 U.S. election
  • 3. Outline of the dataset
  • 4. Datasets Filter stream: ● Candidates and key election hashtags ● Democratic Convention ● GOP Convention ● First presidential debate ● Second presidential debate ● Third presidential debate ● Vice-presidential debate ● Election Day User timelines: ● Democratic candidates ● Democratic Party ● Republican candidates ● Republican Party
  • 5. Candidates and key election hashtags ● Track: election2016, election, clinton, kaine, trump, pence ● Follow: @realDonaldTrump, @HillaryClinton, @timkaine, @mike_pence ● 251,077,140 tweets ● July 13, 2016 - November 10, 2016
  • 6. Democratic Convention ● Track: philly convention, philadelphia convention, democratic convention, dnc convention, #demsinphilly, #dnc, #philly, #demconvention ● Follow: @DemConvention, @TheDemocrats ● 8,340,668 tweets ● July 22, 2016 - July 30, 2016
  • 7. Democratic Candidates ● Accounts: @BernieSanders, @HillaryClinton, @MartinOMalley, @SenSanders, @timkaine ● 22,251 tweets ● Collected every week
  • 10. Top tweeters 561k tweets, 15 followers suspended deleted deleted tweets primarily in Greek 577k tweets, last tweeted Nov 7, 2017 126k tweets, 5 followers deleted still tweeting (915k) at non-human rates
  • 11. Top mentions Where is @timkaine?
  • 15. Social Feed Manager (SFM) ● Open source software by GW Libraries. ● User interface for collecting, managing & exporting social media data. ● Goal: Lower the technical barriers for collecting social media data for academic research and archiving. ● Supports Twitter, Tumblr, Flickr & Sina Weibo. ● Intended for organizations to run for their users. go.gwu.edu/sfm
  • 16. Step 1a: Create a collection
  • 17. Step 1b: Describe the collection
  • 18. Step 1c: Specify what is to be collected
  • 19. Step 2: Turn on collecting
  • 20. Step 3: Monitor collecting
  • 22. Collecting got off to a rough start ...
  • 23. Dataset caveats: Holes Candidates and key election hashtags dataset by week Family road trip to Michigan & Canada. We loved Toronto!
  • 24. Dataset caveats: Rate limits Tweet rate (by minute) from Democratic Convention Rate limit plateau
  • 25. Dataset caveats: Non-U.S. election tweets
  • 27. Sharing the dataset ● Twitter’s developer policies require sharing tweet ids only. ● Complete tweets can be “hydrated” from Twitter API. ○ Hydrating complete dataset takes about a month. ● Tweets that are deleted or from accounts that are protected, deleted, or suspended are not available. ● Provides a “right to be forgotten” but also: ○ Complicates reproducible research ○ Difficult to hold politicians accountable, research bots. ● However, share complete tweets within university.
  • 28. Sharing the dataset: Harvard’s Dataverse doi.org/10.7910/DVN/PDI7IN
  • 29. Sharing the dataset: Harvard’s Dataverse ● Almost 3,000 downloads (as of mid-2018). ● Each collection has a README. → Interested in collaborating on best practices for sharing datasets.
  • 30. Sharing the dataset: TweetSets ● Open source software by GW Libraries. ● Basic idea: Reuse existing datasets, but allow to filter / query for only the tweets that are needed. ● Conforms with Twitter policies. ○ Within university: Complete tweets ○ Public: Tweet ids only tweetsets.library.gwu.edu
  • 31. TweetSets step 1: Select source datasets
  • 32. TweetSets step 2a: Query the tweets in datasets
  • 33. TweetSets step 2a: Query the tweets in datasets ● Tweet text ● Hashtags ● Mentions ● Posted by ● In reply to ● Tweet type ● Created at ● URL ● Has image ● Is geotagged Also, query by:
  • 34. TweetSets step 2b View summary statistics
  • 35. TweetSets step 2c: View sample tweets
  • 36. TweetSets step 3: Create a dataset
  • 38. Uses of the dataset
  • 39. Academic research ● Clare H. Liu, “Applications of Twitter Emotion Detection for Stock Market Prediction.” Masters thesis at MIT. ● David Anuta, Josh Churchin & Jiebo Luo, “Election Bias: Comparing Polls and Twitter in the 2016 U.S. Election.” ● Sicheng Zhao, Yue Gao, Guiguang Ding & Tat-Seng Chua. “Real-Time Multimedia Social Event Detection in Microblog.” IEEE Transactions on Cybernetics. ● Ahsen J. Uppal & H. Howie Huang, “Event Prediction from Dynamic Communities in Social Networks.”
  • 40. Journalists ● Significant interest in dataset after release of list of IRA accounts by Senate Intelligence Committee. ● We identified 36,210 tweets from these accounts. ● Sharing these deleted tweets violates Twitter policy. ● University weighed public interest vs. risk of losing access to Twitter API for GW researchers. ● See nbcnews.com/tech/social-media/now-available-more-200- 000-deleted-russian-troll-tweets-n844731
  • 41. Deleted tweets research ● With Catie Bailard (School of Media & Public Affairs, GWU) & Andy Hoagland (data scientist) ● Possible research questions: ○ What is the substantive content of deleted vs. extant tweets about the candidate(s)? ○ What was the relative distribution of deleted / extant tweets in terms of the proportion that were pro- / anti- Hillary / Trump? ○ Were tweets with a certain type of content more likely to be deleted than those with other types of content?
  • 42. Deleted tweets research ● Possible research questions: ○ What portion of tweets deleted by Twitter were likely-bots vs. likely-humans? Were there differences in the substantive content of deleted tweets generated by likely-humans versus likely-bots?
  • 43. Deleted tweets research ● 92 million tweets from October 8th and November 8th 2016 which contain “Clinton,” “Trump,” “Donald,” “Hillary,” “@realDonaldTrump” or “@HillaryClinton”. ● Split deleted tweets from extant tweets. ○ 22 million tweets (24%) were deleted ● Created 10% sample of deleted tweets & 1.5% sample of extant tweets.
  • 44. Deleted tweets research ● For each tweet in deleted tweets sample, determined reason for deletion. ○ For example: user suspended, original user suspended, tweet deleted ● For each user in each of the samples, ran bot detection. ○ Botometer, using API. ○ Used tweets from full dataset, rather than live Twitter. ○ Not all users had enough tweets.
  • 45. Deleted tweets research ● Performing content analysis of 3000 tweets. ○ Coding for overall “gist” (anti-Trump, anti-Hillary, pro-Trump, and/or pro-Hillary), specific subject matter (e.g., criticizes candidate’s personal qualities or past actions, calls-to-action), identity (e.g., race, gender), more. ○ Three humans code each tweet using DiscoverText. ○ Average Krippendorff’s Alpha score 0.73. ● Will use neural network machine learning to generalize to larger dataset.
  • 47. Botometer scores for deleted tweets
  • 48. Plans for 2018 election
  • 49. Plans for #election2018: Currently collecting ● Neutral: #Nov2018, #Election2018, #Nov18, #Election18, #Midterms2018, #Midterms18, #Midterm2018, #Midterm18, #midtermelection, #election, #vote, 2018 election, election 2018, midterm election ● Partisan Republican: #trump, #maga, #gop, #republican, #trumptrain, #kag ● Partisan Democrat: #bluewave2018, #bluewave18, #bluewave, #democrats, #resist, #resistance
  • 50. Plans for #election2018: Currently collecting ● Top accounts ○ 5,000+ accounts extracted from neutral collection because a top tweeter, retweeted account, or mentioned account. ○ Add new accounts every week from rolling 2 weeks of tweets. ○ Already seeing significant churn as accounts are suspended.
  • 51. Plans for #election2018: ● Individual candidates ● Local parties ● Local hashtags → Currently in discussions with a news organization to collaborate on identifying these accounts / hashtags. → Thinking about how to “cut through noise” to collect tweets from citizens. → Working on contemporaneous web archiving of linked web resources and media.
  • 53.
  • 54. Questions? More info: ● go.gwu.edu/gwsfm ● @SocialFeedMgr ● sfm@gwu.edu Or: ● @justin_littman ● justinlittman@gwu.edu