Credibility Ranking of Tweets during High Impact Events
Internship
1. User Classification
Approach:
• Classified present users into 4 categories
• Considered one categories as positive and one negative and two unknown
• collected latest 1000 tweets from 2000 users approx(1000 each)
• built a classifier and used bag of words techinique
2. Categories
Used User Location and User TimeZone for Categorizing
4 types
1.
2.
3.
4.
Location TimeZone Percentage
Empty Singapore 31
Singapore/sg/+6
5/spore/s'pore/pl
ace in Singapore
anything 44
City/country
other than
Singapore
Singapore 13
Random text Singapore 11
3. Considered type 1 and 4 as unknown.Type 2 as positive and type 3 as negative
Collected 1000 Tweets of 1000 users each type 2&3 (took over a day to collect
data)
Used sklearn package for building a classifier
Used stop words removal function of sklearn and tokenizer of ours.
80% data as train set and 20% as test set
used SVM.LinearSVC Classifier
9. Out of 2989 users in above region 1713 scanned.
• The Above Region is expected to have high number of
bots.
• Users are classified using Bot or Not
• Region is 1900-2100 friends vs 0-2000 followers
• Scanned only expected non - protected
and expected above 100 tweets users
only.(2100 , but 400 failed)
10. The First 677 Users in the
DB are Tested By bot or
not
11. Number of Protected Users
Count of
Tweets
Protected sg Not protected
sg
Protected all Not protected
all
< 100 18 978 85 328 43 909 131 125
>= 100 71 141 99 484 201 241 248 970
total 80 219 184 812 245 150 380 095
265 031 625 245
12. Bot or Not Test by Truthy
• Out of 99 484 Users probable non
protected and No of tweets greater
than 100 ,85 280 Users are tested by
Truthy Score
• The pie Chart Represents the
Distribution of users according to bot
being chances score
13. What happens when we follow users?(20 K
Users)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 71
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
7.67
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
11.2
User mentions and unkown - 10
14. What happens when we follow users?
24-june -- 9th july (20 K Users*)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 55.2(230K)
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
14.6
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
8.78
User Mention ['tweet']['entities']['user_me
ntions']
0.23
unkown - 20.9
* 20 k users are different users from before slide
15. Some statistics on 221K tweets(known
Category)(Cont)
20K users Followed But 3752 tweets from distinct users are Received. 221K Of
230K are only analyzed
Count of tweets field
40.9K In_reply_to_user_id = not null
37.5K In_reply_to_status_id = not null
1938(360 distinct users) Geo = True
3363 In_reply user_id true but status false
90.2K Retweets
16. Some statistics on 131K tweets(known
Category)
131k tweets are from before slide 221K tweets on removing Retweets
3533 Distinct Users tweets in 131 K
* value same as before slide
Count of tweets field
40.9K* In_reply_to_user_id = not null
37.5K* In_reply_to_status_id = not null
1938(360 distinct users)* Geo = True
3363* In_reply user_id true but status false
17. Some statistics on 131K tweets(known
Category)
Count of tweets Number of users source
43 214 1331 Twitter For iPhone
38 256 1079 Twitter For Android
12 577 974 Twitter Web Client
5 658 1016 Instagram
3 420 256 Facebook
3 066 67 TweetDeck
... ... ...
2005 1 AFF Autotweet
... ... ...
18. Some statistics on 131K tweets(known
Category) Tweets mentioning url
Count users Tweet Domain
96K 2681 Null(no mention of url)
5.9K 794 Twitter.com
5.7K 1037 Instagram
Count Number of Tweets
1 only 34 852
2 only 481
3 only 24
4 only (Its the Max) 1
Out of 34.8 K tweets with url ,15K tweets url domain and actual domain are different
20. 2M GeoTagged Tweets
collected from Oct 30th
Source Tweet Count Percentage
Twitter for
iPhone
817K 40.8%
Twitter for
Android
641K 32%
Instagram 265K 13.3%
Foursquare 193K 9.7%
Others 83.5K 4.2%
21. What happens when we follow users?
From 3july-9th July(20K^)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 52(76.5K)
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
16.8
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
8.9
User mentions and unkown - 25.5
^ 20K users same as before slide
22. Some statistics on 76.5K tweets(known
Category)
20K users Followed over 5 Days But 2950 users tweets are Recieved.
Count of tweets field
13.4K In_reply_to_user_id = not null
12.3K In_reply_to_status_id = not null
720 (231 distinct users) Geo = True
1099 In_reply user_id true but status false
23. From 3july-9th July
Out of 76.5 K tweets only 720(0.94%) are geo tagged
Out of 76.5k tweets 7 K tweets showed positive location (type 1 or 2)
Out of 720 tweets 330 tweets showed positive location (type 1 or 2)
Out of 720 tweets 175 tweets showed positive location (type 2 only)
About 60 tweets are duplicates in 76.5 k tweets
24. Two months
Started collecting tweets -user-timeline from April 28th 2015 of unknown sector
users.
Used about 1.4M tweets to our location detection
6% tweets showed a positive location in tweets
Format:Name no of times no of users
The Displayed statistics of about 2695 users
25. Some statistics on 11M tweets(Unknown
Category)
26K users over two months
Count of tweets field
2.05M In_reply_to_user_id = not null
1.96M In_reply_to_status_id = not null
115 K (6913 distinct users) Geo = True
94.7 k In_reply user_id true but status false
26. Out of 6913 users(Unknown Category)
Geo tweets User Count User Percent Min 30 tweets
Count
<1% 2293 33% 2293
<2% and >=1% 902 13% 902
<5% and >=2% 1260 18.2% 1210
<10% and >=5% 813 11.7% 743
<25% and >=10% 794 11.4% 636
<50% and >=25% 462 6.6% 296
>=50% 389 5.6% 174
27.
28. Few Statistics on 5.96M Known
Singaporeans
Count of tweets Field
1.25M In_reply_to_user_id = not null
1.15M In_reply_to_status_id = not null
302K(10421 Users) Geo = True
101K In_reply user_id true but status false
Of about 31K Users and atmost last 200 tweets per user
29. Out of 10421 users(known Category)
Geo tweets User Count User Percent
<1% 1272 12.2%
<2% and >=1% 1434 13.7%
<5% and >=2% 1928 18.5%
<10% and >=5% 1479 14%
<25% and >=10% 2125 20.5%
<50% and >=25% 1428 13.6%
>=50% 755 7.2%
30.
31. Mainstream crawler And Actual data
Made a new stream with
FILTER_KEYWORDS = ['changi
airport','fansofchangi', cineleisure
orchard','vivo city','ion orchard',
'causewaypoint', 'woodlands checkpoint',
'gardensbythebay', 'bugisjunction', 'far
east plaza', 'itecollegeeast', 'ite college
west', 'ite college central'] and their few
variations
Got around 4.1k tweets from new
stream
At the same time frame 20k tweets
were collected by Mainstream
20% hit rate ( 20% tweets of new
stream are in Mainstream)
Recall that Mainstream is the
geotweets of Singapore
1134(27.5%) of 4.1k tweets are
geotagged and 834(20%) tweets are
found in Mainstream.
Out of 300 (7.5%)tweets which are
geotagged
31 tweets outside Singapore
279 tweets inside Singapore
Out of 4.1K tweets only 2.5K shows
positive location in our location
detector
32. Emotion Identification of Tweets
Have a list of 8222 emotion words classified as positive/negative or neutral and
strong/weak subject .
Have a list of 1500 emoji
Have a data set of tweets of around 200 days from oct 30th 2014 to May 5th 2015
Around 24% tweets contain at least a Emoji
Around 54% tweets contain one of the word from 8222 words
Around 64% tweets contain one of the word/emoji(union of above two cases)
43. Few Points
Spikes in the Graphs are generally because of
event/festival/weekend/Holidays
3rd of December has a Spike Since there was
an Event by EXO in Singapore( found out by
Word Count )
44. Unexplained Spikes in Graph
There are few days where higher
number of tweets per day go
unexplained.(8-3-15)
Tried word counter around 8-3-15
date and used stop words from
mysql.com
Found some other issue.
2nd place is taken by the letter @
@ and # tags are generally imp tags
@[total] 39.2%
@[space] 11.5%
@[nospace] 27.7%
Day around 8th March Day around 13th March
45. Examples
Challenge how to combine
places/locations/etc like Marina Bay
Sands and MarinaBaySands ???
• @marinabaysands 998 tweets
• @ Marina Bay Sands 2529 tweets
• @McDonald 20 tweets
• @ McDonald 1099 tweets