Internship

User Classification
Approach:
• Classified present users into 4 categories
• Considered one categories as positive and one negative and two unknown
• collected latest 1000 tweets from 2000 users approx(1000 each)
• built a classifier and used bag of words techinique

Categories
 Used User Location and User TimeZone for Categorizing
 4 types
1.
2.
3.
4.
Location TimeZone Percentage
Empty Singapore 31
Singapore/sg/+6
5/spore/s'pore/pl
ace in Singapore
anything 44
City/country
other than
Singapore
Singapore 13
Random text Singapore 11

 Considered type 1 and 4 as unknown.Type 2 as positive and type 3 as negative
 Collected 1000 Tweets of 1000 users each type 2&3 (took over a day to collect
data)
 Used sklearn package for building a classifier
 Used stop words removal function of sklearn and tokenizer of ours.
 80% data as train set and 20% as test set
 used SVM.LinearSVC Classifier

For 625K (All)Users
Following Slides are
zoomed in

Out of 2989 users in above region 1713 scanned.
• The Above Region is expected to have high number of
bots.
• Users are classified using Bot or Not
• Region is 1900-2100 friends vs 0-2000 followers
• Scanned only expected non - protected
and expected above 100 tweets users
only.(2100 , but 400 failed)

The First 677 Users in the
DB are Tested By bot or
not

Number of Protected Users
Count of
Tweets
Protected sg Not protected
sg
Protected all Not protected
all
< 100 18 978 85 328 43 909 131 125
>= 100 71 141 99 484 201 241 248 970
total 80 219 184 812 245 150 380 095
265 031 625 245

Bot or Not Test by Truthy
• Out of 99 484 Users probable non
protected and No of tweets greater
than 100 ,85 280 Users are tested by
Truthy Score
• The pie Chart Represents the
Distribution of users according to bot
being chances score

What happens when we follow users?(20 K
Users)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 71
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
'user']['id_str']
7.67
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
11.2
User mentions and unkown - 10

What happens when we follow users?
24-june -- 9th july (20 K Users*)
User is sender of the tweet ['tweet']['user']['id_str'] 55.2(230K)
retweeted
&&
'user']['id_str']
14.6
d_str']
8.78
User Mention ['tweet']['entities']['user_me
ntions']
0.23
unkown - 20.9
* 20 k users are different users from before slide

Some statistics on 221K tweets(known
Category)(Cont)
 20K users Followed But 3752 tweets from distinct users are Received. 221K Of
230K are only analyzed
Count of tweets field
40.9K In_reply_to_user_id = not null
37.5K In_reply_to_status_id = not null
1938(360 distinct users) Geo = True
3363 In_reply user_id true but status false
90.2K Retweets

Category)
 131k tweets are from before slide 221K tweets on removing Retweets
 3533 Distinct Users tweets in 131 K
* value same as before slide
40.9K* In_reply_to_user_id = not null
37.5K* In_reply_to_status_id = not null
1938(360 distinct users)* Geo = True
3363* In_reply user_id true but status false

Category)
Count of tweets Number of users source
43 214 1331 Twitter For iPhone
38 256 1079 Twitter For Android
12 577 974 Twitter Web Client
5 658 1016 Instagram
3 420 256 Facebook
3 066 67 TweetDeck
... ... ...
2005 1 AFF Autotweet
... ... ...

Category) Tweets mentioning url
Count users Tweet Domain
96K 2681 Null(no mention of url)
5.9K 794 Twitter.com
5.7K 1037 Instagram
Count Number of Tweets
1 only 34 852
2 only 481
3 only 24
4 only (Its the Max) 1
Out of 34.8 K tweets with url ,15K tweets url domain and actual domain are different

2M GeoTagged Tweets
collected from Oct 30th
Source Tweet Count Percentage
Twitter for
iPhone
817K 40.8%
Twitter for
Android
641K 32%
Instagram 265K 13.3%
Foursquare 193K 9.7%
Others 83.5K 4.2%

What happens when we follow users?
From 3july-9th July(20K^)
User is sender of the tweet ['tweet']['user']['id_str'] 52(76.5K)
retweeted
&&
'user']['id_str']
16.8
d_str']
8.9
User mentions and unkown - 25.5
^ 20K users same as before slide

Some statistics on 76.5K tweets(known
Category)
 20K users Followed over 5 Days But 2950 users tweets are Recieved.
13.4K In_reply_to_user_id = not null
12.3K In_reply_to_status_id = not null
720 (231 distinct users) Geo = True
1099 In_reply user_id true but status false

From 3july-9th July
 Out of 76.5 K tweets only 720(0.94%) are geo tagged
 Out of 76.5k tweets 7 K tweets showed positive location (type 1 or 2)
 Out of 720 tweets 330 tweets showed positive location (type 1 or 2)
 Out of 720 tweets 175 tweets showed positive location (type 2 only)
 About 60 tweets are duplicates in 76.5 k tweets

Two months
 Started collecting tweets -user-timeline from April 28th 2015 of unknown sector
users.
 Used about 1.4M tweets to our location detection
 6% tweets showed a positive location in tweets
 Format:Name no of times no of users
 The Displayed statistics of about 2695 users

Some statistics on 11M tweets(Unknown
Category)
 26K users over two months
2.05M In_reply_to_user_id = not null
1.96M In_reply_to_status_id = not null
115 K (6913 distinct users) Geo = True
94.7 k In_reply user_id true but status false

Out of 6913 users(Unknown Category)
Geo tweets User Count User Percent Min 30 tweets
Count
<1% 2293 33% 2293
<2% and >=1% 902 13% 902
<5% and >=2% 1260 18.2% 1210
<10% and >=5% 813 11.7% 743
<25% and >=10% 794 11.4% 636
<50% and >=25% 462 6.6% 296
>=50% 389 5.6% 174

Few Statistics on 5.96M Known
Singaporeans
Count of tweets Field
1.25M In_reply_to_user_id = not null
1.15M In_reply_to_status_id = not null
302K(10421 Users) Geo = True
101K In_reply user_id true but status false
Of about 31K Users and atmost last 200 tweets per user

Out of 10421 users(known Category)
Geo tweets User Count User Percent
<1% 1272 12.2%
<2% and >=1% 1434 13.7%
<5% and >=2% 1928 18.5%
<10% and >=5% 1479 14%
<25% and >=10% 2125 20.5%
<50% and >=25% 1428 13.6%
>=50% 755 7.2%

Mainstream crawler And Actual data
 Made a new stream with
FILTER_KEYWORDS = ['changi
airport','fansofchangi', cineleisure
orchard','vivo city','ion orchard',
'causewaypoint', 'woodlands checkpoint',
'gardensbythebay', 'bugisjunction', 'far
east plaza', 'itecollegeeast', 'ite college
west', 'ite college central'] and their few
variations
 Got around 4.1k tweets from new
stream
 At the same time frame 20k tweets
were collected by Mainstream
 20% hit rate ( 20% tweets of new
stream are in Mainstream)
 Recall that Mainstream is the
geotweets of Singapore
 1134(27.5%) of 4.1k tweets are
geotagged and 834(20%) tweets are
found in Mainstream.
 Out of 300 (7.5%)tweets which are
geotagged
 31 tweets outside Singapore
 279 tweets inside Singapore
 Out of 4.1K tweets only 2.5K shows
positive location in our location
detector

Emotion Identification of Tweets
 Have a list of 8222 emotion words classified as positive/negative or neutral and
strong/weak subject .
 Have a list of 1500 emoji
 Have a data set of tweets of around 200 days from oct 30th 2014 to May 5th 2015
 Around 24% tweets contain at least a Emoji
 Around 54% tweets contain one of the word from 8222 words
 Around 64% tweets contain one of the word/emoji(union of above two cases)

Saturdays are
highlighted by DOT

Daily Basis
Saturdays are
highlighted by DOT

Few Points
• Scripted accounts

Few Points
 Spikes in the Graphs are generally because of
event/festival/weekend/Holidays
 3rd of December has a Spike Since there was
an Event by EXO in Singapore( found out by
Word Count )

Unexplained Spikes in Graph
 There are few days where higher
number of tweets per day go
unexplained.(8-3-15)
 Tried word counter around 8-3-15
date and used stop words from
mysql.com
 Found some other issue.
 2nd place is taken by the letter @
 @ and # tags are generally imp tags
 @[total] 39.2%
 @[space] 11.5%
 @[nospace] 27.7%
Day around 8th March Day around 13th March

Examples
Challenge how to combine
places/locations/etc like Marina Bay
Sands and MarinaBaySands ???
• @marinabaysands 998 tweets
• @ Marina Bay Sands 2529 tweets
• @McDonald 20 tweets
• @ McDonald 1099 tweets

Done By:
Gadi Venkata Sai Rahul
May 21 2015 - June 22 2015

Internship

More Related Content

Viewers also liked

Similar to Internship

Internship