SlideShare a Scribd company logo
1 of 52
User Classification
Approach:
• Classified present users into 4 categories
• Considered one categories as positive and one negative and two unknown
• collected latest 1000 tweets from 2000 users approx(1000 each)
• built a classifier and used bag of words techinique
Categories
 Used User Location and User TimeZone for Categorizing
 4 types
1.
2.
3.
4.
Location TimeZone Percentage
Empty Singapore 31
Singapore/sg/+6
5/spore/s'pore/pl
ace in Singapore
anything 44
City/country
other than
Singapore
Singapore 13
Random text Singapore 11
 Considered type 1 and 4 as unknown.Type 2 as positive and type 3 as negative
 Collected 1000 Tweets of 1000 users each type 2&3 (took over a day to collect
data)
 Used sklearn package for building a classifier
 Used stop words removal function of sklearn and tokenizer of ours.
 80% data as train set and 20% as test set
 used SVM.LinearSVC Classifier
For 625K (All)Users
Following Slides are
zoomed in
For 625K (All)Users
For 625K (All)Users
For 625K (All)Users
For 625K (All)Users
Out of 2989 users in above region 1713 scanned.
• The Above Region is expected to have high number of
bots.
• Users are classified using Bot or Not
• Region is 1900-2100 friends vs 0-2000 followers
• Scanned only expected non - protected
and expected above 100 tweets users
only.(2100 , but 400 failed)
The First 677 Users in the
DB are Tested By bot or
not
Number of Protected Users
Count of
Tweets
Protected sg Not protected
sg
Protected all Not protected
all
< 100 18 978 85 328 43 909 131 125
>= 100 71 141 99 484 201 241 248 970
total 80 219 184 812 245 150 380 095
265 031 625 245
Bot or Not Test by Truthy
• Out of 99 484 Users probable non
protected and No of tweets greater
than 100 ,85 280 Users are tested by
Truthy Score
• The pie Chart Represents the
Distribution of users according to bot
being chances score
What happens when we follow users?(20 K
Users)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 71
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
7.67
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
11.2
User mentions and unkown - 10
What happens when we follow users?
24-june -- 9th july (20 K Users*)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 55.2(230K)
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
14.6
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
8.78
User Mention ['tweet']['entities']['user_me
ntions']
0.23
unkown - 20.9
* 20 k users are different users from before slide
Some statistics on 221K tweets(known
Category)(Cont)
 20K users Followed But 3752 tweets from distinct users are Received. 221K Of
230K are only analyzed
Count of tweets field
40.9K In_reply_to_user_id = not null
37.5K In_reply_to_status_id = not null
1938(360 distinct users) Geo = True
3363 In_reply user_id true but status false
90.2K Retweets
Some statistics on 131K tweets(known
Category)
 131k tweets are from before slide 221K tweets on removing Retweets
 3533 Distinct Users tweets in 131 K
* value same as before slide
Count of tweets field
40.9K* In_reply_to_user_id = not null
37.5K* In_reply_to_status_id = not null
1938(360 distinct users)* Geo = True
3363* In_reply user_id true but status false
Some statistics on 131K tweets(known
Category)
Count of tweets Number of users source
43 214 1331 Twitter For iPhone
38 256 1079 Twitter For Android
12 577 974 Twitter Web Client
5 658 1016 Instagram
3 420 256 Facebook
3 066 67 TweetDeck
... ... ...
2005 1 AFF Autotweet
... ... ...
Some statistics on 131K tweets(known
Category) Tweets mentioning url
Count users Tweet Domain
96K 2681 Null(no mention of url)
5.9K 794 Twitter.com
5.7K 1037 Instagram
Count Number of Tweets
1 only 34 852
2 only 481
3 only 24
4 only (Its the Max) 1
Out of 34.8 K tweets with url ,15K tweets url domain and actual domain are different
GeoTagged Tweets 1938
2M GeoTagged Tweets
collected from Oct 30th
Source Tweet Count Percentage
Twitter for
iPhone
817K 40.8%
Twitter for
Android
641K 32%
Instagram 265K 13.3%
Foursquare 193K 9.7%
Others 83.5K 4.2%
What happens when we follow users?
From 3july-9th July(20K^)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 52(76.5K)
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
16.8
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
8.9
User mentions and unkown - 25.5
^ 20K users same as before slide
Some statistics on 76.5K tweets(known
Category)
 20K users Followed over 5 Days But 2950 users tweets are Recieved.
Count of tweets field
13.4K In_reply_to_user_id = not null
12.3K In_reply_to_status_id = not null
720 (231 distinct users) Geo = True
1099 In_reply user_id true but status false
From 3july-9th July
 Out of 76.5 K tweets only 720(0.94%) are geo tagged
 Out of 76.5k tweets 7 K tweets showed positive location (type 1 or 2)
 Out of 720 tweets 330 tweets showed positive location (type 1 or 2)
 Out of 720 tweets 175 tweets showed positive location (type 2 only)
 About 60 tweets are duplicates in 76.5 k tweets
Two months
 Started collecting tweets -user-timeline from April 28th 2015 of unknown sector
users.
 Used about 1.4M tweets to our location detection
 6% tweets showed a positive location in tweets
 Format:Name no of times no of users
 The Displayed statistics of about 2695 users
Some statistics on 11M tweets(Unknown
Category)
 26K users over two months
Count of tweets field
2.05M In_reply_to_user_id = not null
1.96M In_reply_to_status_id = not null
115 K (6913 distinct users) Geo = True
94.7 k In_reply user_id true but status false
Out of 6913 users(Unknown Category)
Geo tweets User Count User Percent Min 30 tweets
Count
<1% 2293 33% 2293
<2% and >=1% 902 13% 902
<5% and >=2% 1260 18.2% 1210
<10% and >=5% 813 11.7% 743
<25% and >=10% 794 11.4% 636
<50% and >=25% 462 6.6% 296
>=50% 389 5.6% 174
Few Statistics on 5.96M Known
Singaporeans
Count of tweets Field
1.25M In_reply_to_user_id = not null
1.15M In_reply_to_status_id = not null
302K(10421 Users) Geo = True
101K In_reply user_id true but status false
Of about 31K Users and atmost last 200 tweets per user
Out of 10421 users(known Category)
Geo tweets User Count User Percent
<1% 1272 12.2%
<2% and >=1% 1434 13.7%
<5% and >=2% 1928 18.5%
<10% and >=5% 1479 14%
<25% and >=10% 2125 20.5%
<50% and >=25% 1428 13.6%
>=50% 755 7.2%
Mainstream crawler And Actual data
 Made a new stream with
FILTER_KEYWORDS = ['changi
airport','fansofchangi', cineleisure
orchard','vivo city','ion orchard',
'causewaypoint', 'woodlands checkpoint',
'gardensbythebay', 'bugisjunction', 'far
east plaza', 'itecollegeeast', 'ite college
west', 'ite college central'] and their few
variations
 Got around 4.1k tweets from new
stream
 At the same time frame 20k tweets
were collected by Mainstream
 20% hit rate ( 20% tweets of new
stream are in Mainstream)
 Recall that Mainstream is the
geotweets of Singapore
 1134(27.5%) of 4.1k tweets are
geotagged and 834(20%) tweets are
found in Mainstream.
 Out of 300 (7.5%)tweets which are
geotagged
 31 tweets outside Singapore
 279 tweets inside Singapore
 Out of 4.1K tweets only 2.5K shows
positive location in our location
detector
Emotion Identification of Tweets
 Have a list of 8222 emotion words classified as positive/negative or neutral and
strong/weak subject .
 Have a list of 1500 emoji
 Have a data set of tweets of around 200 days from oct 30th 2014 to May 5th 2015
 Around 24% tweets contain at least a Emoji
 Around 54% tweets contain one of the word from 8222 words
 Around 64% tweets contain one of the word/emoji(union of above two cases)
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Daily Basis
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Few Points
• Scripted accounts
Few Points
 Spikes in the Graphs are generally because of
event/festival/weekend/Holidays
 3rd of December has a Spike Since there was
an Event by EXO in Singapore( found out by
Word Count )
Unexplained Spikes in Graph
 There are few days where higher
number of tweets per day go
unexplained.(8-3-15)
 Tried word counter around 8-3-15
date and used stop words from
mysql.com
 Found some other issue.
 2nd place is taken by the letter @
 @ and # tags are generally imp tags
 @[total] 39.2%
 @[space] 11.5%
 @[nospace] 27.7%
Day around 8th March Day around 13th March
Examples
Challenge how to combine
places/locations/etc like Marina Bay
Sands and MarinaBaySands ???
• @marinabaysands 998 tweets
• @ Marina Bay Sands 2529 tweets
• @McDonald 20 tweets
• @ McDonald 1099 tweets
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Saturdays are
highlighted by DOT
Done By:
Gadi Venkata Sai Rahul
May 21 2015 - June 22 2015

More Related Content

Viewers also liked

2015/09/14付 オリジナルiTunes週間トップソングトピックス
2015/09/14付 オリジナルiTunes週間トップソングトピックス2015/09/14付 オリジナルiTunes週間トップソングトピックス
2015/09/14付 オリジナルiTunes週間トップソングトピックスThe Natsu Style
 
2015 06 27_СТОЛИЧНИЙ ОКРУГ
2015 06 27_СТОЛИЧНИЙ ОКРУГ2015 06 27_СТОЛИЧНИЙ ОКРУГ
2015 06 27_СТОЛИЧНИЙ ОКРУГIgor Naida
 
2016/02/01付 オリジナルiTunes週間トップソングトピックス
2016/02/01付 オリジナルiTunes週間トップソングトピックス2016/02/01付 オリジナルiTunes週間トップソングトピックス
2016/02/01付 オリジナルiTunes週間トップソングトピックスThe Natsu Style
 
Energia SOI - Become a Learning Specialist
Energia SOI - Become a Learning Specialist Energia SOI - Become a Learning Specialist
Energia SOI - Become a Learning Specialist ambereen pradhan
 
Online Learning Theory
Online Learning TheoryOnline Learning Theory
Online Learning TheoryLisa M Lane
 

Viewers also liked (7)

2015/09/14付 オリジナルiTunes週間トップソングトピックス
2015/09/14付 オリジナルiTunes週間トップソングトピックス2015/09/14付 オリジナルiTunes週間トップソングトピックス
2015/09/14付 オリジナルiTunes週間トップソングトピックス
 
2015 06 27_СТОЛИЧНИЙ ОКРУГ
2015 06 27_СТОЛИЧНИЙ ОКРУГ2015 06 27_СТОЛИЧНИЙ ОКРУГ
2015 06 27_СТОЛИЧНИЙ ОКРУГ
 
2016/02/01付 オリジナルiTunes週間トップソングトピックス
2016/02/01付 オリジナルiTunes週間トップソングトピックス2016/02/01付 オリジナルiTunes週間トップソングトピックス
2016/02/01付 オリジナルiTunes週間トップソングトピックス
 
Energia SOI - Become a Learning Specialist
Energia SOI - Become a Learning Specialist Energia SOI - Become a Learning Specialist
Energia SOI - Become a Learning Specialist
 
Online Learning Theory
Online Learning TheoryOnline Learning Theory
Online Learning Theory
 
SDN OpenDaylight
SDN OpenDaylightSDN OpenDaylight
SDN OpenDaylight
 
The arrow of time
The arrow of timeThe arrow of time
The arrow of time
 

Similar to Internship

Privacy and Security in Online Social Media : Trust and Credebillity on OSM
Privacy and Security in Online Social Media : Trust and Credebillity on OSMPrivacy and Security in Online Social Media : Trust and Credebillity on OSM
Privacy and Security in Online Social Media : Trust and Credebillity on OSMIIIT Hyderabad
 
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...IIIT Hyderabad
 
Detecting Good Abandonment in Mobile Search
Detecting Good Abandonment in Mobile SearchDetecting Good Abandonment in Mobile Search
Detecting Good Abandonment in Mobile SearchJulia Kiseleva
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
 
Monitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social MediaMonitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social MediaDaniel Kershaw
 
NMIX 4200 Final Paper Report
NMIX 4200 Final Paper ReportNMIX 4200 Final Paper Report
NMIX 4200 Final Paper ReportPatrick Grant
 
Management and analysis of social media data
Management and analysis of social media dataManagement and analysis of social media data
Management and analysis of social media dataWeining Qian
 
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment Comparison
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment ComparisonPeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment Comparison
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment ComparisonPeopleBrowsr
 
2017 05-26 NodeXL Twitter search #shakeupshow
2017 05-26 NodeXL Twitter search #shakeupshow2017 05-26 NodeXL Twitter search #shakeupshow
2017 05-26 NodeXL Twitter search #shakeupshowMarc Smith
 
Earthquake shakes twitter users
Earthquake shakes twitter usersEarthquake shakes twitter users
Earthquake shakes twitter usersEshan Mudwel
 
Brand Digital Asset Analysis (Facebook FanPage & Twitter)
Brand Digital Asset Analysis (Facebook FanPage & Twitter)Brand Digital Asset Analysis (Facebook FanPage & Twitter)
Brand Digital Asset Analysis (Facebook FanPage & Twitter)MediaWave
 
Jakarta Music Event 2012
Jakarta Music Event 2012Jakarta Music Event 2012
Jakarta Music Event 2012MediaWave
 
Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...
Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...
Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...Università of Urbino Carlo Bo
 
Repetition and rhythmicity based assessment model for chat conversations pr...
Repetition and rhythmicity based assessment model for chat conversations   pr...Repetition and rhythmicity based assessment model for chat conversations   pr...
Repetition and rhythmicity based assessment model for chat conversations pr...University Politehnica Bucharest
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2The Night's Watch
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsCarla Marini
 
Fran Cavanagh -- Strategic Communications Insight; Network Analysis
Fran Cavanagh -- Strategic Communications Insight; Network AnalysisFran Cavanagh -- Strategic Communications Insight; Network Analysis
Fran Cavanagh -- Strategic Communications Insight; Network AnalysisFederal Communicators Network
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with EsperAntónio Alegria
 
Credibility Ranking of Tweets during High Impact Events
Credibility Ranking of Tweets during High Impact EventsCredibility Ranking of Tweets during High Impact Events
Credibility Ranking of Tweets during High Impact EventsIIIT Hyderabad
 

Similar to Internship (20)

Privacy and Security in Online Social Media : Trust and Credebillity on OSM
Privacy and Security in Online Social Media : Trust and Credebillity on OSMPrivacy and Security in Online Social Media : Trust and Credebillity on OSM
Privacy and Security in Online Social Media : Trust and Credebillity on OSM
 
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
TweetCred: Real-Time Credibility Assessment of 
 Content on Twitter @ Socinfo...
 
Detecting Good Abandonment in Mobile Search
Detecting Good Abandonment in Mobile SearchDetecting Good Abandonment in Mobile Search
Detecting Good Abandonment in Mobile Search
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Monitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social MediaMonitoring Regional Alcohol Consumption through Social Media
Monitoring Regional Alcohol Consumption through Social Media
 
Sp150502ss
Sp150502ssSp150502ss
Sp150502ss
 
NMIX 4200 Final Paper Report
NMIX 4200 Final Paper ReportNMIX 4200 Final Paper Report
NMIX 4200 Final Paper Report
 
Management and analysis of social media data
Management and analysis of social media dataManagement and analysis of social media data
Management and analysis of social media data
 
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment Comparison
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment ComparisonPeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment Comparison
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment Comparison
 
2017 05-26 NodeXL Twitter search #shakeupshow
2017 05-26 NodeXL Twitter search #shakeupshow2017 05-26 NodeXL Twitter search #shakeupshow
2017 05-26 NodeXL Twitter search #shakeupshow
 
Earthquake shakes twitter users
Earthquake shakes twitter usersEarthquake shakes twitter users
Earthquake shakes twitter users
 
Brand Digital Asset Analysis (Facebook FanPage & Twitter)
Brand Digital Asset Analysis (Facebook FanPage & Twitter)Brand Digital Asset Analysis (Facebook FanPage & Twitter)
Brand Digital Asset Analysis (Facebook FanPage & Twitter)
 
Jakarta Music Event 2012
Jakarta Music Event 2012Jakarta Music Event 2012
Jakarta Music Event 2012
 
Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...
Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...
Second Screen and Political Talk-Shows: Measuring and Understanding the Itali...
 
Repetition and rhythmicity based assessment model for chat conversations pr...
Repetition and rhythmicity based assessment model for chat conversations   pr...Repetition and rhythmicity based assessment model for chat conversations   pr...
Repetition and rhythmicity based assessment model for chat conversations pr...
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
Fran Cavanagh -- Strategic Communications Insight; Network Analysis
Fran Cavanagh -- Strategic Communications Insight; Network AnalysisFran Cavanagh -- Strategic Communications Insight; Network Analysis
Fran Cavanagh -- Strategic Communications Insight; Network Analysis
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 
Credibility Ranking of Tweets during High Impact Events
Credibility Ranking of Tweets during High Impact EventsCredibility Ranking of Tweets during High Impact Events
Credibility Ranking of Tweets during High Impact Events
 

Internship

  • 1. User Classification Approach: • Classified present users into 4 categories • Considered one categories as positive and one negative and two unknown • collected latest 1000 tweets from 2000 users approx(1000 each) • built a classifier and used bag of words techinique
  • 2. Categories  Used User Location and User TimeZone for Categorizing  4 types 1. 2. 3. 4. Location TimeZone Percentage Empty Singapore 31 Singapore/sg/+6 5/spore/s'pore/pl ace in Singapore anything 44 City/country other than Singapore Singapore 13 Random text Singapore 11
  • 3.  Considered type 1 and 4 as unknown.Type 2 as positive and type 3 as negative  Collected 1000 Tweets of 1000 users each type 2&3 (took over a day to collect data)  Used sklearn package for building a classifier  Used stop words removal function of sklearn and tokenizer of ours.  80% data as train set and 20% as test set  used SVM.LinearSVC Classifier
  • 4. For 625K (All)Users Following Slides are zoomed in
  • 9. Out of 2989 users in above region 1713 scanned. • The Above Region is expected to have high number of bots. • Users are classified using Bot or Not • Region is 1900-2100 friends vs 0-2000 followers • Scanned only expected non - protected and expected above 100 tweets users only.(2100 , but 400 failed)
  • 10. The First 677 Users in the DB are Tested By bot or not
  • 11. Number of Protected Users Count of Tweets Protected sg Not protected sg Protected all Not protected all < 100 18 978 85 328 43 909 131 125 >= 100 71 141 99 484 201 241 248 970 total 80 219 184 812 245 150 380 095 265 031 625 245
  • 12. Bot or Not Test by Truthy • Out of 99 484 Users probable non protected and No of tweets greater than 100 ,85 280 Users are tested by Truthy Score • The pie Chart Represents the Distribution of users according to bot being chances score
  • 13. What happens when we follow users?(20 K Users) type Json format percentage User is sender of the tweet ['tweet']['user']['id_str'] 71 user's tweet has been retweeted ['tweet']['retweeted_status'] 'in_reply_to_user_id_str'] && ['tweet']['retweeted_status'] 'user']['id_str'] 7.67 user's has been replied to ['tweet']['in_reply_to_user_i d_str'] 11.2 User mentions and unkown - 10
  • 14. What happens when we follow users? 24-june -- 9th july (20 K Users*) type Json format percentage User is sender of the tweet ['tweet']['user']['id_str'] 55.2(230K) user's tweet has been retweeted ['tweet']['retweeted_status'] 'in_reply_to_user_id_str'] && ['tweet']['retweeted_status'] 'user']['id_str'] 14.6 user's has been replied to ['tweet']['in_reply_to_user_i d_str'] 8.78 User Mention ['tweet']['entities']['user_me ntions'] 0.23 unkown - 20.9 * 20 k users are different users from before slide
  • 15. Some statistics on 221K tweets(known Category)(Cont)  20K users Followed But 3752 tweets from distinct users are Received. 221K Of 230K are only analyzed Count of tweets field 40.9K In_reply_to_user_id = not null 37.5K In_reply_to_status_id = not null 1938(360 distinct users) Geo = True 3363 In_reply user_id true but status false 90.2K Retweets
  • 16. Some statistics on 131K tweets(known Category)  131k tweets are from before slide 221K tweets on removing Retweets  3533 Distinct Users tweets in 131 K * value same as before slide Count of tweets field 40.9K* In_reply_to_user_id = not null 37.5K* In_reply_to_status_id = not null 1938(360 distinct users)* Geo = True 3363* In_reply user_id true but status false
  • 17. Some statistics on 131K tweets(known Category) Count of tweets Number of users source 43 214 1331 Twitter For iPhone 38 256 1079 Twitter For Android 12 577 974 Twitter Web Client 5 658 1016 Instagram 3 420 256 Facebook 3 066 67 TweetDeck ... ... ... 2005 1 AFF Autotweet ... ... ...
  • 18. Some statistics on 131K tweets(known Category) Tweets mentioning url Count users Tweet Domain 96K 2681 Null(no mention of url) 5.9K 794 Twitter.com 5.7K 1037 Instagram Count Number of Tweets 1 only 34 852 2 only 481 3 only 24 4 only (Its the Max) 1 Out of 34.8 K tweets with url ,15K tweets url domain and actual domain are different
  • 20. 2M GeoTagged Tweets collected from Oct 30th Source Tweet Count Percentage Twitter for iPhone 817K 40.8% Twitter for Android 641K 32% Instagram 265K 13.3% Foursquare 193K 9.7% Others 83.5K 4.2%
  • 21. What happens when we follow users? From 3july-9th July(20K^) type Json format percentage User is sender of the tweet ['tweet']['user']['id_str'] 52(76.5K) user's tweet has been retweeted ['tweet']['retweeted_status'] 'in_reply_to_user_id_str'] && ['tweet']['retweeted_status'] 'user']['id_str'] 16.8 user's has been replied to ['tweet']['in_reply_to_user_i d_str'] 8.9 User mentions and unkown - 25.5 ^ 20K users same as before slide
  • 22. Some statistics on 76.5K tweets(known Category)  20K users Followed over 5 Days But 2950 users tweets are Recieved. Count of tweets field 13.4K In_reply_to_user_id = not null 12.3K In_reply_to_status_id = not null 720 (231 distinct users) Geo = True 1099 In_reply user_id true but status false
  • 23. From 3july-9th July  Out of 76.5 K tweets only 720(0.94%) are geo tagged  Out of 76.5k tweets 7 K tweets showed positive location (type 1 or 2)  Out of 720 tweets 330 tweets showed positive location (type 1 or 2)  Out of 720 tweets 175 tweets showed positive location (type 2 only)  About 60 tweets are duplicates in 76.5 k tweets
  • 24. Two months  Started collecting tweets -user-timeline from April 28th 2015 of unknown sector users.  Used about 1.4M tweets to our location detection  6% tweets showed a positive location in tweets  Format:Name no of times no of users  The Displayed statistics of about 2695 users
  • 25. Some statistics on 11M tweets(Unknown Category)  26K users over two months Count of tweets field 2.05M In_reply_to_user_id = not null 1.96M In_reply_to_status_id = not null 115 K (6913 distinct users) Geo = True 94.7 k In_reply user_id true but status false
  • 26. Out of 6913 users(Unknown Category) Geo tweets User Count User Percent Min 30 tweets Count <1% 2293 33% 2293 <2% and >=1% 902 13% 902 <5% and >=2% 1260 18.2% 1210 <10% and >=5% 813 11.7% 743 <25% and >=10% 794 11.4% 636 <50% and >=25% 462 6.6% 296 >=50% 389 5.6% 174
  • 27.
  • 28. Few Statistics on 5.96M Known Singaporeans Count of tweets Field 1.25M In_reply_to_user_id = not null 1.15M In_reply_to_status_id = not null 302K(10421 Users) Geo = True 101K In_reply user_id true but status false Of about 31K Users and atmost last 200 tweets per user
  • 29. Out of 10421 users(known Category) Geo tweets User Count User Percent <1% 1272 12.2% <2% and >=1% 1434 13.7% <5% and >=2% 1928 18.5% <10% and >=5% 1479 14% <25% and >=10% 2125 20.5% <50% and >=25% 1428 13.6% >=50% 755 7.2%
  • 30.
  • 31. Mainstream crawler And Actual data  Made a new stream with FILTER_KEYWORDS = ['changi airport','fansofchangi', cineleisure orchard','vivo city','ion orchard', 'causewaypoint', 'woodlands checkpoint', 'gardensbythebay', 'bugisjunction', 'far east plaza', 'itecollegeeast', 'ite college west', 'ite college central'] and their few variations  Got around 4.1k tweets from new stream  At the same time frame 20k tweets were collected by Mainstream  20% hit rate ( 20% tweets of new stream are in Mainstream)  Recall that Mainstream is the geotweets of Singapore  1134(27.5%) of 4.1k tweets are geotagged and 834(20%) tweets are found in Mainstream.  Out of 300 (7.5%)tweets which are geotagged  31 tweets outside Singapore  279 tweets inside Singapore  Out of 4.1K tweets only 2.5K shows positive location in our location detector
  • 32. Emotion Identification of Tweets  Have a list of 8222 emotion words classified as positive/negative or neutral and strong/weak subject .  Have a list of 1500 emoji  Have a data set of tweets of around 200 days from oct 30th 2014 to May 5th 2015  Around 24% tweets contain at least a Emoji  Around 54% tweets contain one of the word from 8222 words  Around 64% tweets contain one of the word/emoji(union of above two cases)
  • 35.
  • 37.
  • 40.
  • 41.
  • 43. Few Points  Spikes in the Graphs are generally because of event/festival/weekend/Holidays  3rd of December has a Spike Since there was an Event by EXO in Singapore( found out by Word Count )
  • 44. Unexplained Spikes in Graph  There are few days where higher number of tweets per day go unexplained.(8-3-15)  Tried word counter around 8-3-15 date and used stop words from mysql.com  Found some other issue.  2nd place is taken by the letter @  @ and # tags are generally imp tags  @[total] 39.2%  @[space] 11.5%  @[nospace] 27.7% Day around 8th March Day around 13th March
  • 45. Examples Challenge how to combine places/locations/etc like Marina Bay Sands and MarinaBaySands ??? • @marinabaysands 998 tweets • @ Marina Bay Sands 2529 tweets • @McDonald 20 tweets • @ McDonald 1099 tweets
  • 52. Done By: Gadi Venkata Sai Rahul May 21 2015 - June 22 2015