The document discusses user classification of Twitter users into 4 categories based on their location and timezone information extracted from their tweets and profiles. Over 2000 users were classified, with their most recent 1000 tweets collected and analyzed. A classifier was built using bag-of-words technique to categorize the users. The categories were then used to collect tweets from positive and negative users to build a training dataset for a classifier to identify bots.
A novel way of verifiable redistribution of the secret in a multiuser environ...eSAT Publishing House
This document proposes a novel method for verifiably redistributing secrets in a multi-user environment using threshold secret sharing and group keys. It involves a dealer distributing shares of a secret to authorized users, and a group manager who verifies members and notifies the dealer of any changes. If the group changes, the dealer generates new shares without involving old members, encrypts them using group members' public keys, and sends them to the group manager. The manager distributes the shares to the group using a group key. Members can verify their shares against hash values from the dealer. This allows secret redistribution without private channels or involvement of old members in generating new shares.
letter of recommendationMann. from sima gordon, kav l'noarMelech (Mel) Mann
This document provides information about the staff and consulting board of the Community Mentoring Program. It lists the CEO, founding director, clinical supervisor, supervisors of the community and school mentoring programs, administrator, and consulting board which includes professionals in psychology, medicine, neuropsychology, halacha, addictions and law. The second part is a letter of recommendation for Mel Mann from their mentoring supervisor Sima Gordon, praising Mel's teamwork, initiative, creative approach, and ability to maintain a joyful atmosphere while focusing on clinical realities and progress requirements.
WebOffice is an offshore software solutions company based in Austria with a development office in Ahmedabad, India. They have been operating in India for two years and Austria for 13 years. The company provides services such as website development, graphics design, mobile app development, and custom software solutions primarily to German clients. WebOffice has 40 employees and hires professionals with skills in areas like graphic design, responsive web design, content management systems, e-commerce platforms, and mobile app development.
This document contains contact information for Arjan van der Meij, identifying him as a physics teacher, science leader, and chairman of several boards and committees related to education. It lists his email, Twitter handle, and two blogs. The rest of the document consists of bullet point lists of various maker education projects, tools, spaces, historical figures, and anecdotes from his experience in maker education.
A novel way of verifiable redistribution of the secret in a multiuser environ...eSAT Publishing House
This document proposes a novel method for verifiably redistributing secrets in a multi-user environment using threshold secret sharing and group keys. It involves a dealer distributing shares of a secret to authorized users, and a group manager who verifies members and notifies the dealer of any changes. If the group changes, the dealer generates new shares without involving old members, encrypts them using group members' public keys, and sends them to the group manager. The manager distributes the shares to the group using a group key. Members can verify their shares against hash values from the dealer. This allows secret redistribution without private channels or involvement of old members in generating new shares.
letter of recommendationMann. from sima gordon, kav l'noarMelech (Mel) Mann
This document provides information about the staff and consulting board of the Community Mentoring Program. It lists the CEO, founding director, clinical supervisor, supervisors of the community and school mentoring programs, administrator, and consulting board which includes professionals in psychology, medicine, neuropsychology, halacha, addictions and law. The second part is a letter of recommendation for Mel Mann from their mentoring supervisor Sima Gordon, praising Mel's teamwork, initiative, creative approach, and ability to maintain a joyful atmosphere while focusing on clinical realities and progress requirements.
WebOffice is an offshore software solutions company based in Austria with a development office in Ahmedabad, India. They have been operating in India for two years and Austria for 13 years. The company provides services such as website development, graphics design, mobile app development, and custom software solutions primarily to German clients. WebOffice has 40 employees and hires professionals with skills in areas like graphic design, responsive web design, content management systems, e-commerce platforms, and mobile app development.
This document contains contact information for Arjan van der Meij, identifying him as a physics teacher, science leader, and chairman of several boards and committees related to education. It lists his email, Twitter handle, and two blogs. The rest of the document consists of bullet point lists of various maker education projects, tools, spaces, historical figures, and anecdotes from his experience in maker education.
Energia SOI program is changing the future of learning and building stronger academic foundations. We are offering fastest growing cognitive skills building program for children enrichment.
For the Program for Online Teaching Certificate class, a review of the three online pedagogical models. Creative Commons licensed Lisa M Lane Attribution-NonCommercial-ShareAlike 2012.
The document discusses the concept of irreversible processes and the arrow of time. It explains that many natural phenomena, like glass shattering or organisms aging, cannot go backwards due to increasing entropy. The second law of thermodynamics states that entropy in an isolated system is constantly increasing, which requires the distinction of past and future and defines the direction of time as irreversible processes move towards more disorder. Playing videos of these processes in reverse violates our intuition about entropy and the arrow of time.
Privacy and Security in Online Social Media : Trust and Credebillity on OSMIIIT Hyderabad
This document summarizes a lecture on privacy and security in online social media. It discusses analyzing misinformation spread on social media during real-world events like hurricanes and bombings. Features of tweets and user profiles are used to classify tweets as real or fake. A Chrome extension called TweetCred is demonstrated that analyzes tweets in real-time to assess credibility using machine learning models trained on these features. The lecture covers collecting, filtering, and annotating social media data from events. Network and linguistic analysis are used to understand information flow and credibility.
TweetCred: Real-Time Credibility Assessment of Content on Twitter @ Socinfo...IIIT Hyderabad
This document describes research on real-time credibility assessment of tweets. The researchers created a system called TweetCred that scores tweets for credibility in real-time based on a semi-supervised ranking model. TweetCred was deployed live and scored over 7 million tweets from over 1,400 Twitter users. The researchers evaluated TweetCred on response time, effectiveness, and usability based on surveys of 67 users, finding an average usability score of 70. Future work could focus on personalizing credibility scores based on a user's social network and exploring psychological factors influencing information credibility on Twitter.
Detecting Good Abandonment in Mobile SearchJulia Kiseleva
Web search queries for which there are no clicks are referred to as abandoned queries and are usually considered
as leading to user dissatisfaction. However, there are many
cases where a user may not click on any search result page
(SERP) but still be satised. This scenario is referred to
as good abandonment and presents a challenge for most ap-
proaches measuring search satisfaction, which are usually
based on clicks and dwell time. The problem is exacerbated
further on mobile devices where search providers try to in-
crease the likelihood of users being satised directly by the
SERP. This paper proposes a solution to this problem us-
ing gesture interactions, such as reading times and touch
actions, as signals for dierentiating between good and bad
abandonment. These signals go beyond clicks and charac-
terize user behavior in cases where clicks are not needed to
achieve satisfaction. We study different good abandonment
scenarios and investigate the dierent elements on a SERP
that may lead to good abandonment. We also present an
analysis of the correlation between user gesture features and
satisfaction. Finally, we use this analysis to build models to
automatically identify good abandonment in mobile search
achieving an accuracy of 75%, which is significantly better
than considering query and session signals alone. Our fundings have implications for the study and application of user
satisfaction in search systems.
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
There has been much effort on studying how social media sites, such as Twitter, help propagate information in differ- ent situations, including spreading alerts and SOS messages in an emergency. However, existing work has not addressed how to actively identify and engage the right strangers at the right time on social media to help effectively propagate intended information within a desired time frame. To ad- dress this problem, we have developed two models: (i) a feature-based model that leverages peoples’ exhibited social behavior, including the content of their tweets and social interactions, to characterize their willingness and readiness to propagate information on Twitter via the act of retweeting; and (ii) a wait-time model based on a user's previous retweeting wait times to predict her next retweeting time when asked. Based on these two models, we build a recommender system that predicts the likelihood of a stranger to retweet information when asked, within a specific time window, and recommends the top-N qualified strangers to engage with. Our experiments, including live studies in the real world, demonstrate the effectiveness of our work.
Presented at Intelligent User Interfaces 2014, Haifa, Israel. February 27, 2014.
The document discusses data visualization and social media analysis. It describes a faculty member who is interested in data visualization, big data, and social media. Examples are provided of analyzing crowdfunding data from multiple online sources and visualizing Twitter networks related to specific topics. Metrics for optimizing crowdfunding campaigns through social media are suggested. Graphs are presented analyzing tweet volumes related to a comet and the growth of Twitter followers over time.
This paper analyzes social media conversations around the TomorrowWorld music festival through two Twitter data sets collected a month apart. It finds that the first data set focused mainly on performances from this year's festival, while the second shifted to next year's event. The paper also examines the Twitter account @belugaPOD and recommends increasing interactions with important users and involvement in smaller conversations to improve their presence. Google Analytics showed most important website visitors came from SoundCloud. Overall, the paper aims to understand social media discussions of TomorrowWorld and how to enhance @belugaPOD's online and social media presence.
Management and analysis of social media dataWeining Qian
This document discusses social media data analysis based on a case study of Sina Weibo data. It outlines collecting data through a distributed crawler, modeling the spread of tweets using sigmoid functions, and developing a schema to manage the user, tweet, retweet and followship network data. Queries are proposed to analyze trends, influential users and communities. Ongoing work includes developing a social media data generator for benchmarking, analyzing collective behavior and mood over time, and creating a shared dataset of trending topics on Sina Weibo.
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment ComparisonPeopleBrowsr
The document analyzes tweets from SXSW 2009 and 2010 conferences. It provides statistics on the total mentions of keywords like #sxsw and influential twitter users each year. Sentiment analysis of sampled tweets from each year found most were neutral, with a higher percentage being positive in 2010. It also describes PeopleBrowsr's tools and strategies for analyzing large-scale social media conversations in real-time.
2017 05-26 NodeXL Twitter search #shakeupshowMarc Smith
The document is a report generated by NodeXL analyzing a Twitter network related to the hashtag #ShakeUpShow. It includes metrics on the network such as the number of nodes and edges, top influencers by number of followers, most shared URLs, domains, hashtags, words and word pairs used. It also lists the top accounts replied to, mentioned and most active tweeters in the network.
1) The document describes using Twitter data to detect real-world events in real-time, specifically focusing on detecting earthquakes using tweets from Japan.
2) An algorithm is proposed that performs semantic analysis on tweets to identify those related to earthquakes, and uses Twitter users' locations and posting times to estimate where and when earthquakes occurred.
3) The system was evaluated on past earthquakes in Japan, finding it could detect 96% of magnitude 3 or greater quakes and send alert emails before official announcements, with the fastest alerts being sent 19 seconds before.
Brand Digital Asset Analysis (Facebook FanPage & Twitter)MediaWave
It's not only about fans and follower number. Interaction and Engagement rate are important key for brand Facebook and Twitter account. Understand your fans/follower can help you maximize your account. You can monitor your competitor too!
First study on a complete dataset of Tweet
Speech presented during the 4th edition of Transforming Audiences Conference, University of Westminister - 3 Septermber 2013
The slides present a model and an application that can be used to assess chat conversations according to their content, which is related to a number of imposed topics, and to the personal involvement of the participants. The main theoretical ideas that stand behind this application are Bakhtin’s polyphony theory and Tannen’s ideas related to the use of repetitions. The results of the application are validated against the gold standard provided by two teachers from the Human-Computer Interaction evaluating the same chats and after that the verification is done using another teacher from the same domain. During the verification we also show that the model used for chat evaluation is dependent on the number of participants to that chat
Twitter Timeline and Search Distributed System.pptxMd. Rakib Trofder
Design the Twitter timeline and search
Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
Design the Facebook feed and Design Facebook search are similar questions.
Step 1: Outline use cases and constraints
Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
Use cases
We'll scope the problem to handle only the following use cases
User posts a tweet
Service pushes tweets to followers, sending push notifications and emails
User views the user timeline (activity from the user)
User views the home timeline (activity from people the user is following)
User searches keywords
Service has high availability
Out of scope
Service pushes tweets to the Twitter Firehose and other streams
Service strips out tweets based on users' visibility settings
Hide @reply if the user is not also following the person being replied to
Respect 'hide retweets' setting
Analytics
Constraints and assumptions
State assumptions
General
Traffic is not evenly distributed
Posting a tweet should be fast
Fanning out a tweet to all of your followers should be fast, unless you have millions of followers
100 million active users
500 million tweets per day or 15 billion tweets per month
Each tweet averages a fanout of 10 deliveries
5 billion total tweets delivered on fanout per day
150 billion tweets delivered on fanout per month
250 billion read requests per month
10 billion searches per month
The document discusses detecting trends through Twitter streams. It describes how Twitter tracks terms that appear with high frequency over time to identify trending topics. Specifically, it presents a method to extract trending topics from Twitter's API using the Z-score algorithm and Lossy-Counting streaming algorithm. The author conducted an experiment running this approach over 400-600 minutes of Twitter data each day, which identified the most frequent terms occurring in around 90% of minutes.
Energia SOI program is changing the future of learning and building stronger academic foundations. We are offering fastest growing cognitive skills building program for children enrichment.
For the Program for Online Teaching Certificate class, a review of the three online pedagogical models. Creative Commons licensed Lisa M Lane Attribution-NonCommercial-ShareAlike 2012.
The document discusses the concept of irreversible processes and the arrow of time. It explains that many natural phenomena, like glass shattering or organisms aging, cannot go backwards due to increasing entropy. The second law of thermodynamics states that entropy in an isolated system is constantly increasing, which requires the distinction of past and future and defines the direction of time as irreversible processes move towards more disorder. Playing videos of these processes in reverse violates our intuition about entropy and the arrow of time.
Privacy and Security in Online Social Media : Trust and Credebillity on OSMIIIT Hyderabad
This document summarizes a lecture on privacy and security in online social media. It discusses analyzing misinformation spread on social media during real-world events like hurricanes and bombings. Features of tweets and user profiles are used to classify tweets as real or fake. A Chrome extension called TweetCred is demonstrated that analyzes tweets in real-time to assess credibility using machine learning models trained on these features. The lecture covers collecting, filtering, and annotating social media data from events. Network and linguistic analysis are used to understand information flow and credibility.
TweetCred: Real-Time Credibility Assessment of Content on Twitter @ Socinfo...IIIT Hyderabad
This document describes research on real-time credibility assessment of tweets. The researchers created a system called TweetCred that scores tweets for credibility in real-time based on a semi-supervised ranking model. TweetCred was deployed live and scored over 7 million tweets from over 1,400 Twitter users. The researchers evaluated TweetCred on response time, effectiveness, and usability based on surveys of 67 users, finding an average usability score of 70. Future work could focus on personalizing credibility scores based on a user's social network and exploring psychological factors influencing information credibility on Twitter.
Detecting Good Abandonment in Mobile SearchJulia Kiseleva
Web search queries for which there are no clicks are referred to as abandoned queries and are usually considered
as leading to user dissatisfaction. However, there are many
cases where a user may not click on any search result page
(SERP) but still be satised. This scenario is referred to
as good abandonment and presents a challenge for most ap-
proaches measuring search satisfaction, which are usually
based on clicks and dwell time. The problem is exacerbated
further on mobile devices where search providers try to in-
crease the likelihood of users being satised directly by the
SERP. This paper proposes a solution to this problem us-
ing gesture interactions, such as reading times and touch
actions, as signals for dierentiating between good and bad
abandonment. These signals go beyond clicks and charac-
terize user behavior in cases where clicks are not needed to
achieve satisfaction. We study different good abandonment
scenarios and investigate the dierent elements on a SERP
that may lead to good abandonment. We also present an
analysis of the correlation between user gesture features and
satisfaction. Finally, we use this analysis to build models to
automatically identify good abandonment in mobile search
achieving an accuracy of 75%, which is significantly better
than considering query and session signals alone. Our fundings have implications for the study and application of user
satisfaction in search systems.
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Jeffrey Nichols
There has been much effort on studying how social media sites, such as Twitter, help propagate information in differ- ent situations, including spreading alerts and SOS messages in an emergency. However, existing work has not addressed how to actively identify and engage the right strangers at the right time on social media to help effectively propagate intended information within a desired time frame. To ad- dress this problem, we have developed two models: (i) a feature-based model that leverages peoples’ exhibited social behavior, including the content of their tweets and social interactions, to characterize their willingness and readiness to propagate information on Twitter via the act of retweeting; and (ii) a wait-time model based on a user's previous retweeting wait times to predict her next retweeting time when asked. Based on these two models, we build a recommender system that predicts the likelihood of a stranger to retweet information when asked, within a specific time window, and recommends the top-N qualified strangers to engage with. Our experiments, including live studies in the real world, demonstrate the effectiveness of our work.
Presented at Intelligent User Interfaces 2014, Haifa, Israel. February 27, 2014.
The document discusses data visualization and social media analysis. It describes a faculty member who is interested in data visualization, big data, and social media. Examples are provided of analyzing crowdfunding data from multiple online sources and visualizing Twitter networks related to specific topics. Metrics for optimizing crowdfunding campaigns through social media are suggested. Graphs are presented analyzing tweet volumes related to a comet and the growth of Twitter followers over time.
This paper analyzes social media conversations around the TomorrowWorld music festival through two Twitter data sets collected a month apart. It finds that the first data set focused mainly on performances from this year's festival, while the second shifted to next year's event. The paper also examines the Twitter account @belugaPOD and recommends increasing interactions with important users and involvement in smaller conversations to improve their presence. Google Analytics showed most important website visitors came from SoundCloud. Overall, the paper aims to understand social media discussions of TomorrowWorld and how to enhance @belugaPOD's online and social media presence.
Management and analysis of social media dataWeining Qian
This document discusses social media data analysis based on a case study of Sina Weibo data. It outlines collecting data through a distributed crawler, modeling the spread of tweets using sigmoid functions, and developing a schema to manage the user, tweet, retweet and followship network data. Queries are proposed to analyze trends, influential users and communities. Ongoing work includes developing a social media data generator for benchmarking, analyzing collective behavior and mood over time, and creating a shared dataset of trending topics on Sina Weibo.
PeopleBrowsr SXSW 2009 and 2010 Analytics and Sentiment ComparisonPeopleBrowsr
The document analyzes tweets from SXSW 2009 and 2010 conferences. It provides statistics on the total mentions of keywords like #sxsw and influential twitter users each year. Sentiment analysis of sampled tweets from each year found most were neutral, with a higher percentage being positive in 2010. It also describes PeopleBrowsr's tools and strategies for analyzing large-scale social media conversations in real-time.
2017 05-26 NodeXL Twitter search #shakeupshowMarc Smith
The document is a report generated by NodeXL analyzing a Twitter network related to the hashtag #ShakeUpShow. It includes metrics on the network such as the number of nodes and edges, top influencers by number of followers, most shared URLs, domains, hashtags, words and word pairs used. It also lists the top accounts replied to, mentioned and most active tweeters in the network.
1) The document describes using Twitter data to detect real-world events in real-time, specifically focusing on detecting earthquakes using tweets from Japan.
2) An algorithm is proposed that performs semantic analysis on tweets to identify those related to earthquakes, and uses Twitter users' locations and posting times to estimate where and when earthquakes occurred.
3) The system was evaluated on past earthquakes in Japan, finding it could detect 96% of magnitude 3 or greater quakes and send alert emails before official announcements, with the fastest alerts being sent 19 seconds before.
Brand Digital Asset Analysis (Facebook FanPage & Twitter)MediaWave
It's not only about fans and follower number. Interaction and Engagement rate are important key for brand Facebook and Twitter account. Understand your fans/follower can help you maximize your account. You can monitor your competitor too!
First study on a complete dataset of Tweet
Speech presented during the 4th edition of Transforming Audiences Conference, University of Westminister - 3 Septermber 2013
The slides present a model and an application that can be used to assess chat conversations according to their content, which is related to a number of imposed topics, and to the personal involvement of the participants. The main theoretical ideas that stand behind this application are Bakhtin’s polyphony theory and Tannen’s ideas related to the use of repetitions. The results of the application are validated against the gold standard provided by two teachers from the Human-Computer Interaction evaluating the same chats and after that the verification is done using another teacher from the same domain. During the verification we also show that the model used for chat evaluation is dependent on the number of participants to that chat
Twitter Timeline and Search Distributed System.pptxMd. Rakib Trofder
Design the Twitter timeline and search
Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
Design the Facebook feed and Design Facebook search are similar questions.
Step 1: Outline use cases and constraints
Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
Use cases
We'll scope the problem to handle only the following use cases
User posts a tweet
Service pushes tweets to followers, sending push notifications and emails
User views the user timeline (activity from the user)
User views the home timeline (activity from people the user is following)
User searches keywords
Service has high availability
Out of scope
Service pushes tweets to the Twitter Firehose and other streams
Service strips out tweets based on users' visibility settings
Hide @reply if the user is not also following the person being replied to
Respect 'hide retweets' setting
Analytics
Constraints and assumptions
State assumptions
General
Traffic is not evenly distributed
Posting a tweet should be fast
Fanning out a tweet to all of your followers should be fast, unless you have millions of followers
100 million active users
500 million tweets per day or 15 billion tweets per month
Each tweet averages a fanout of 10 deliveries
5 billion total tweets delivered on fanout per day
150 billion tweets delivered on fanout per month
250 billion read requests per month
10 billion searches per month
The document discusses detecting trends through Twitter streams. It describes how Twitter tracks terms that appear with high frequency over time to identify trending topics. Specifically, it presents a method to extract trending topics from Twitter's API using the Z-score algorithm and Lossy-Counting streaming algorithm. The author conducted an experiment running this approach over 400-600 minutes of Twitter data each day, which identified the most frequent terms occurring in around 90% of minutes.
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsCarla Marini
Laoratorio svolto al Master in Business Intelligence & Big Dat Analytic, nel modulo Web Data Analytics
Analisi degli argomenti che trattano temi relativi alla moda in Reddit. Data Scraping, Data Cleaning, Data Clustering, Text Mining and Sentiment Analysis.
Credibility Ranking of Tweets during High Impact EventsIIIT Hyderabad
Twitter has evolved from being a conversation or opinion sharing medium among friends into a platform to share and disseminate information about current events. Events in the real world create a corresponding spur of posts (tweets) on Twitter. Not all content posted on Twitter is trustworthy or useful in providing information about the event. In this paper, we analyzed the credibility of information in tweets corresponding to fourteen high impact news events of 2011 around the globe. From the data we analyzed, on average 30% of total tweets posted about an event contained situational information about the event while 14% was spam. Only 17% of the total tweets posted about the event contained situational awareness information that was credible. Using regression analysis, we identified the important con- tent and sourced based features, which can predict the credibility of information in a tweet. Prominent content based features were number of unique characters, swear words, pronouns, and emoticons in a tweet, and user based features like the number of followers and length of username. We adopted a supervised machine learning and relevance feedback approach using the above features, to rank tweets according to their credibility score. The performance of our ranking algorithm significantly enhanced when we applied re-ranking strategy. Results show that extraction of credible information from Twitter can be automated with high confidence.
Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hu...IIIT Hyderabad
In today's world, online social media plays a vital role during real world events, especially crisis events. There are both positive and negative effects of social media coverage of events, it can be used by authorities for effective disaster management or by malicious entities to spread rumors and fake news. The aim of this paper, is to highlight the role of Twitter, during Hurricane Sandy (2012) to spread fake images about the disaster. We identified 10,350 unique tweets containing fake images that were circulated on Twitter, during Hurricane Sandy. We performed a characterization analysis, to understand the temporal, social reputation and influence patterns for the spread of fake images. Eighty six percent of tweets spreading the fake images were retweets, hence very few were original tweets. Our results showed that top thirty users out of 10,215 users (0.3%) resulted in 90% of the retweets of fake images; also network links such as follower relationships of Twitter, contributed very less (only 11%) to the spread of these fake photos URLs. Next, we used classification models, to distinguish fake images from real images of Hurricane Sandy. Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real. Also, tweet based features were very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor. Our results, showed that, automated techniques can be used in identifying real images from fake images posted on Twitter.
Have a look at the social media metrics behind Juniper's incredible social media presence. See the strategies that drove audience engagement and the content that outperformed everything else.
1. User Classification
Approach:
• Classified present users into 4 categories
• Considered one categories as positive and one negative and two unknown
• collected latest 1000 tweets from 2000 users approx(1000 each)
• built a classifier and used bag of words techinique
2. Categories
Used User Location and User TimeZone for Categorizing
4 types
1.
2.
3.
4.
Location TimeZone Percentage
Empty Singapore 31
Singapore/sg/+6
5/spore/s'pore/pl
ace in Singapore
anything 44
City/country
other than
Singapore
Singapore 13
Random text Singapore 11
3. Considered type 1 and 4 as unknown.Type 2 as positive and type 3 as negative
Collected 1000 Tweets of 1000 users each type 2&3 (took over a day to collect
data)
Used sklearn package for building a classifier
Used stop words removal function of sklearn and tokenizer of ours.
80% data as train set and 20% as test set
used SVM.LinearSVC Classifier
9. Out of 2989 users in above region 1713 scanned.
• The Above Region is expected to have high number of
bots.
• Users are classified using Bot or Not
• Region is 1900-2100 friends vs 0-2000 followers
• Scanned only expected non - protected
and expected above 100 tweets users
only.(2100 , but 400 failed)
10. The First 677 Users in the
DB are Tested By bot or
not
11. Number of Protected Users
Count of
Tweets
Protected sg Not protected
sg
Protected all Not protected
all
< 100 18 978 85 328 43 909 131 125
>= 100 71 141 99 484 201 241 248 970
total 80 219 184 812 245 150 380 095
265 031 625 245
12. Bot or Not Test by Truthy
• Out of 99 484 Users probable non
protected and No of tweets greater
than 100 ,85 280 Users are tested by
Truthy Score
• The pie Chart Represents the
Distribution of users according to bot
being chances score
13. What happens when we follow users?(20 K
Users)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 71
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
7.67
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
11.2
User mentions and unkown - 10
14. What happens when we follow users?
24-june -- 9th july (20 K Users*)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 55.2(230K)
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
14.6
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
8.78
User Mention ['tweet']['entities']['user_me
ntions']
0.23
unkown - 20.9
* 20 k users are different users from before slide
15. Some statistics on 221K tweets(known
Category)(Cont)
20K users Followed But 3752 tweets from distinct users are Received. 221K Of
230K are only analyzed
Count of tweets field
40.9K In_reply_to_user_id = not null
37.5K In_reply_to_status_id = not null
1938(360 distinct users) Geo = True
3363 In_reply user_id true but status false
90.2K Retweets
16. Some statistics on 131K tweets(known
Category)
131k tweets are from before slide 221K tweets on removing Retweets
3533 Distinct Users tweets in 131 K
* value same as before slide
Count of tweets field
40.9K* In_reply_to_user_id = not null
37.5K* In_reply_to_status_id = not null
1938(360 distinct users)* Geo = True
3363* In_reply user_id true but status false
17. Some statistics on 131K tweets(known
Category)
Count of tweets Number of users source
43 214 1331 Twitter For iPhone
38 256 1079 Twitter For Android
12 577 974 Twitter Web Client
5 658 1016 Instagram
3 420 256 Facebook
3 066 67 TweetDeck
... ... ...
2005 1 AFF Autotweet
... ... ...
18. Some statistics on 131K tweets(known
Category) Tweets mentioning url
Count users Tweet Domain
96K 2681 Null(no mention of url)
5.9K 794 Twitter.com
5.7K 1037 Instagram
Count Number of Tweets
1 only 34 852
2 only 481
3 only 24
4 only (Its the Max) 1
Out of 34.8 K tweets with url ,15K tweets url domain and actual domain are different
20. 2M GeoTagged Tweets
collected from Oct 30th
Source Tweet Count Percentage
Twitter for
iPhone
817K 40.8%
Twitter for
Android
641K 32%
Instagram 265K 13.3%
Foursquare 193K 9.7%
Others 83.5K 4.2%
21. What happens when we follow users?
From 3july-9th July(20K^)
type Json format percentage
User is sender of the tweet ['tweet']['user']['id_str'] 52(76.5K)
user's tweet has been
retweeted
['tweet']['retweeted_status']
'in_reply_to_user_id_str']
&&
['tweet']['retweeted_status']
'user']['id_str']
16.8
user's has been replied to ['tweet']['in_reply_to_user_i
d_str']
8.9
User mentions and unkown - 25.5
^ 20K users same as before slide
22. Some statistics on 76.5K tweets(known
Category)
20K users Followed over 5 Days But 2950 users tweets are Recieved.
Count of tweets field
13.4K In_reply_to_user_id = not null
12.3K In_reply_to_status_id = not null
720 (231 distinct users) Geo = True
1099 In_reply user_id true but status false
23. From 3july-9th July
Out of 76.5 K tweets only 720(0.94%) are geo tagged
Out of 76.5k tweets 7 K tweets showed positive location (type 1 or 2)
Out of 720 tweets 330 tweets showed positive location (type 1 or 2)
Out of 720 tweets 175 tweets showed positive location (type 2 only)
About 60 tweets are duplicates in 76.5 k tweets
24. Two months
Started collecting tweets -user-timeline from April 28th 2015 of unknown sector
users.
Used about 1.4M tweets to our location detection
6% tweets showed a positive location in tweets
Format:Name no of times no of users
The Displayed statistics of about 2695 users
25. Some statistics on 11M tweets(Unknown
Category)
26K users over two months
Count of tweets field
2.05M In_reply_to_user_id = not null
1.96M In_reply_to_status_id = not null
115 K (6913 distinct users) Geo = True
94.7 k In_reply user_id true but status false
26. Out of 6913 users(Unknown Category)
Geo tweets User Count User Percent Min 30 tweets
Count
<1% 2293 33% 2293
<2% and >=1% 902 13% 902
<5% and >=2% 1260 18.2% 1210
<10% and >=5% 813 11.7% 743
<25% and >=10% 794 11.4% 636
<50% and >=25% 462 6.6% 296
>=50% 389 5.6% 174
27.
28. Few Statistics on 5.96M Known
Singaporeans
Count of tweets Field
1.25M In_reply_to_user_id = not null
1.15M In_reply_to_status_id = not null
302K(10421 Users) Geo = True
101K In_reply user_id true but status false
Of about 31K Users and atmost last 200 tweets per user
29. Out of 10421 users(known Category)
Geo tweets User Count User Percent
<1% 1272 12.2%
<2% and >=1% 1434 13.7%
<5% and >=2% 1928 18.5%
<10% and >=5% 1479 14%
<25% and >=10% 2125 20.5%
<50% and >=25% 1428 13.6%
>=50% 755 7.2%
30.
31. Mainstream crawler And Actual data
Made a new stream with
FILTER_KEYWORDS = ['changi
airport','fansofchangi', cineleisure
orchard','vivo city','ion orchard',
'causewaypoint', 'woodlands checkpoint',
'gardensbythebay', 'bugisjunction', 'far
east plaza', 'itecollegeeast', 'ite college
west', 'ite college central'] and their few
variations
Got around 4.1k tweets from new
stream
At the same time frame 20k tweets
were collected by Mainstream
20% hit rate ( 20% tweets of new
stream are in Mainstream)
Recall that Mainstream is the
geotweets of Singapore
1134(27.5%) of 4.1k tweets are
geotagged and 834(20%) tweets are
found in Mainstream.
Out of 300 (7.5%)tweets which are
geotagged
31 tweets outside Singapore
279 tweets inside Singapore
Out of 4.1K tweets only 2.5K shows
positive location in our location
detector
32. Emotion Identification of Tweets
Have a list of 8222 emotion words classified as positive/negative or neutral and
strong/weak subject .
Have a list of 1500 emoji
Have a data set of tweets of around 200 days from oct 30th 2014 to May 5th 2015
Around 24% tweets contain at least a Emoji
Around 54% tweets contain one of the word from 8222 words
Around 64% tweets contain one of the word/emoji(union of above two cases)
43. Few Points
Spikes in the Graphs are generally because of
event/festival/weekend/Holidays
3rd of December has a Spike Since there was
an Event by EXO in Singapore( found out by
Word Count )
44. Unexplained Spikes in Graph
There are few days where higher
number of tweets per day go
unexplained.(8-3-15)
Tried word counter around 8-3-15
date and used stop words from
mysql.com
Found some other issue.
2nd place is taken by the letter @
@ and # tags are generally imp tags
@[total] 39.2%
@[space] 11.5%
@[nospace] 27.7%
Day around 8th March Day around 13th March
45. Examples
Challenge how to combine
places/locations/etc like Marina Bay
Sands and MarinaBaySands ???
• @marinabaysands 998 tweets
• @ Marina Bay Sands 2529 tweets
• @McDonald 20 tweets
• @ McDonald 1099 tweets