Short presentation of Twitter's publicly available archives of Tweets and media that the company believed resulted from potentially state-backed information operations on its service for the benefit of Galvanize's Data Science Immersive Fellows on February 22, 2019 in San Francisco.
2. 2
Twitter’s Elections Integrity Datasets
My background:
High-performance computing,
large-scale graph-based algorithms
Data scientist,
machine learning instructor
Instructor, leveling up our
engineering talent
3. Twitter’s focus is on a
healthy public conversation.
Twitter’s Elections Integrity Datasets
4. 4
Twitter Safety
@TwitterSafety
Working with our industry peers today, we have
suspended 284 accounts from Twitter for engaging
in coordinated manipulation. Based on our existing
analysis, it appears many of these accounts
originated from Iran.
21 Aug 20181.8K Retweets 3.7K Likes
We will continue to strengthen Twitter against attempted manipulation, including malicious
automated accounts and spam, as well as other activities that violate our Terms of Service.
Twitter’s Elections Integrity Datasets
5. 5
Twitter’s Elections Integrity Datasets
In line with our principles of transparency and to improve public understanding of alleged foreign
influence campaigns, Twitter is making publicly available archives of Tweets and media that we
believe resulted from potentially state-backed information operations on our service.
https://about.twitter.com/en_us/values/elections-integrity.html#data
October 2018 January 2019
Internet Research Agency
Iran
Bangladesh
Iran
Russia
Venezuela (2 sets)
6. 6
Twitter’s Elections Integrity Datasets
What’s included?
These datasets include all public, non-deleted Tweets and media (e.g., images and
videos) from accounts we believe are connected to state-backed information
operations. Tweets deleted by these users prior to their suspension (which are not
included in these datasets) comprise less than 1% of their overall activity.
Note that not all of the accounts we identified as connected to these campaigns
actively Tweeted, so the number of accounts represented in the datasets may be
less than the total number of accounts listed here.
8. 8
Twitter’s Elections Integrity Datasets
Accounts and Tweets: <dataset>_users_csv_hashed.zip and <dataset>_tweets_csv_hashed.zip are
compressed CSV files with the following fields:
name description
tweetid tweet identification number
userid user identification number(1)
user_screen_name Twitter handle of the user(2)
user_reported_location user’s self-reported location(3)
user_profile_description user’s profile description(3)
user_display_name name of the user(2)
user_profile_url
number of accounts followed by the user(3)
follower_count
following_count
user’s profile URL(3)
number of accounts following the user(3)
name description
account_creation_date date of user account creation
account_language language of the account, as chosen by the user
tweet_text text of the tweet(4)
tweet_time time when the tweet was published (UTC)
tweet_client_name name of the client app used to publish the tweet
tweet_language language of the tweet
in_reply_to_tweetid
tweetid of the original tweet that this tweet is quoting(5)
in_reply_to_userid
quoted_tweet_tweetid
tweetid of the original tweet that this tweet is in reply to(5)
userid of the original tweet that this tweet is in reply to(5)
(1)
anonymized for users which had fewer than 5,000 followers at the time of suspension
(2)
same as userid for anonymized users
(3)
at the time of suspension
(4)
mentions of anonymized accounts have been replaced with anonymized userid
(5)
for replies only
9. 9
Twitter’s Elections Integrity Datasets
Accounts and Tweets: <dataset>_users_csv_hashed.zip and <dataset>_tweets_csv_hashed.zip are
compressed CSV files with the following fields: (cont.)
name description
is_retweet True/False, is this tweet a retweet
retweet_userid userid who authored the original tweet(6)
latitude geo-located latitude, if available
longitude geo-located longitude, if available
quote_count number of tweets quoting this tweet
retweet_tweetid tweetid of the original tweet(6)
reply_count
number of retweets that this tweet received(7)
like_count
retweet_count
number of tweets replying to this tweet
number of likes that this tweet received(7)
name description
hashtags list of hashtags used in this tweet(8)
urls list of urls used in this tweet(8)
poll_choices list of the poll choices(10) (11)
user_mentions list of userids who are mentioned in this tweet(9)
(6)
for retweets only
(7)
these engagement counts exclude engagements from users who are suspended, deleted or otherwise
actioned against by Twitter at the time of this data release
(8)
space separated
(9)
includes anonymized userids
(10)
if a tweet included a poll
(11)
| separated
10. 10
Twitter’s Elections Integrity Datasets
Media:
<dataset>_profile_banner_hashed.zip profile photos and profile banners(1)
<dataset>_tweet_media_hashed
<dataset>_tweet_media_hashed_README
<dataset>_periscope_hashed.zip
folder containing tweet media with numbered .zip
files
details which users’ tweet media are in which
numbered .zip file
periscope broadcasts, where each sub-folder
contains the users' broadcasts(2)
(1)
users with the default Twitter profile pic and/or banner are not included
(2)
users without a Periscope account are not included; users with a Periscope account with no broadcasts have an empty sub-folder