Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1  Data Mining and Analysis on            TwitterCompany Proprietary and Confidential   Copyright Info Goes Here Just Like...
Professor                                                                                                   2             ...
Contents                                                                3                   •      Objective              ...
Objective                                                                               4                   • Large amount...
Contents                                                                5                   •      Objective              ...
Twitter at a glance                                                                                           6           ...
Lingo                                                                      7                   •      Tweet - A message of...
Contents                                                                8                   •      Objective              ...
Modules                                                                                        9        • Data Collection ...
Contents                                                                10                   •      Objective             ...
Data Collection | Data based on location                                                             11      • Collect dat...
Data Collection | Data based on topics                                                               12      • Collect dat...
Data Collection | Data from a group of users                                                      13       • Collect tweet...
Contents                                                                14                   •      Objective             ...
Visualization Results | Streets of London                                         15                   • Setup            ...
Visualization Results | Streets of London | 1 week                                                       16• Analysis     ...
17Company Proprietary and Confidential   Copyright Info Goes Here Just LikeThis
Tweets in London | Aggregated by wards                                            18                                      ...
Tweets about a topic| Manchester United                                                         19                   • Set...
Visualization Results | Tweets About                                                                   20     Manchester U...
Tweets about a topic| Apple                                                                               21              ...
Visualization Results | Tweets About Apple                                                                             22A...
Contents                                                                23                   •      Twitter at a glance   ...
Community Detection| Background                                                                   24                   •  ...
Community Detection| Analysis on small dataset                                                                       25   ...
Analysis on small dataset | Modularity Based Clustering                                                                   ...
Analysis on small dataset | Laplacian Based Clustering                                                                    ...
Community Detection| Analysis on large dataset                                                                      28    ...
Analysis on large dataset| Clustering on Social Connections                                                   29          ...
Analysis on large dataset| Clustering on Social Connections                                                      30       ...
Analysis on large dataset| Clustering on Combined matrices 31Results for data from week 1           Results for week 2    ...
Contents                                                                32                   •      Twitter at a glance   ...
Future Mentions| Reasons for mentions on Twitter                                                          33              ...
Future Mentions| Correlation between features                                                                             ...
Future Work                                                                                            35                 ...
36Company Proprietary and Confidential   Copyright Info Goes Here Just LikeThis
Upcoming SlideShare
Loading in …5
×

Data Mining on Twitter

6,213 views

Published on

With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.

Published in: Education, Technology
  • Be the first to comment

Data Mining on Twitter

  1. 1. 1 Data Mining and Analysis on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  2. 2. Professor 2 • Prof. Pascal Frossard Project Supervisor • Xiaowen Dong Students • Pulkit Goyal (twitter.com/pulkit110) • Sapan Diwakar (twitter.com/diwakarsapan)Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  3. 3. Contents 3 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  4. 4. Objective 4 • Large amount of new data created every minute on social networking sites. – Difficult to obtain and interpret – Collect data to allow for further analysis • Identify online communities of users on Twitter • Explore reasons of user interactions as a step towards prediction of future interactionsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  5. 5. Contents 5 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  6. 6. Twitter at a glance 6 Micro-blogging platform Since March 2006 Status Update 300 Million users (June, 2011) Giant Chat room Instant MessagingCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  7. 7. Lingo 7 • Tweet - A message of 140 characters or less • Retweet - Repeat a tweet from somebody else • Hashtag - Tweet that includes a #term (tracking) • Reply/Mention - Mentioning another user in a tweetCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  8. 8. Contents 8 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  9. 9. Modules 9 • Data Collection – Setup system to collect data based on some constraints • Visualization – Build some visualizations based on the collected data – Analyze the results • Community Detection – Identify communities of users on Twitter based on several different similarty measures • Analysis of Future Mentions – Identify factors for future mentions between users on twitter.Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  10. 10. Contents 10 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  11. 11. Data Collection | Data based on location 11 • Collect data based on locations: Objectives: – London • Model the spread of interests – New York • Time – Paris • Location – San Francisco • Rate of information flow – Mumbai • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common InterestsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  12. 12. Data Collection | Data based on topics 12 • Collect data based on keywords Objectives: – Apple (Tech) • Model the spread of interests – Manchester United (Soccer) • Time • Location • Rate of information flow • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common InterestsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  13. 13. Data Collection | Data from a group of users 13 • Collect tweets from a "group of users" Objectives: – Group of around 25k users • Model the spread of interests • Time • Created by a specified user • Location • Explicitly in-reply-to a status created by a • Rate of information flow specified user (pressed reply button) • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common Interests Overview of links we use to collect usersCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  14. 14. Contents 14 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  15. 15. Visualization Results | Streets of London 15 • Setup – Geo-tagged tweets for one week (16 to 22 August 2011) • 111,206 tweetsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  16. 16. Visualization Results | Streets of London | 1 week 16• Analysis • High density of tweets from famous places/tourist attractions • Clustering of tweets • Content of tweets can be used to predict the place • More tweets along the roads/streets National Gallery London Waterloo Rail The Big Ben London Victoria Rail Oval Cricket Ground Greenwich Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  17. 17. 17Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  18. 18. Tweets in London | Aggregated by wards 18 No. of tweets in increasing orderCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  19. 19. Tweets about a topic| Manchester United 19 • Setup – Data for two weeks (27 Oct to 8 Nov 2011) • Keywords – "manchesterunited", "manchester united", "manchester utd", "man united", "manutd", "man utd", "manu", "mufc"Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  20. 20. Visualization Results | Tweets About 20 Manchester UnitedAnalysis • More tweets in and around Europe • Manchester United plays in the English Premiere League and has homeground in Manchester • High amount of tweets from countries whose players play for Manchester United • High popularity of Manchester United in Indonesia and Malaysia Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  21. 21. Tweets about a topic| Apple 21 • Setup – Data for two weeks (27 Oct to 8 Nov 2011) • Keywords – "apple", "mac", "macbook", "macbookair", "macbookpro", "os x", "osx", "osxlion", "ipod", "ipodshuffle", "ipodnano", "ipodclassic", "ipodtouch", "itunes", "iphone", "iphone3", "iphone3s", "iphone4", "iphone4s", "iphone5", "ios", "ios4", "ios5", "ipad", "ipad2", "ipad3"Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  22. 22. Visualization Results | Tweets About Apple 22Analysis • High volume of tweets in USA and Europe • Popularity of apple products in Europe and USA • Volume of data as compared to Manchester United • 32k tweets (with Geo-Location) about Apple as opposed to 1.4k for Manchester United • Interest about Apple spread over the world whereas for Manchester United, it is limited to few countries Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  23. 23. Contents 23 • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  24. 24. Community Detection| Background 24 • Community – A set of users having strong connections. – Held together by some common interests of a large group of users. • Similarity Measures – Users’ Social Connection – User Mentions – Description Content Similarity – Tweet Content Similarity – Hash-Tag Similarity • Algorithms for community detection – Modularity Maximization Clustering • Spectrum Based • Greedy Bottom-up Fast Modularity Clustering – Spectral ClusteringCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  25. 25. Community Detection| Analysis on small dataset 25 • Experimental setup – 501 users from three different lists on twitter • List id 4293757, 12932674 and 33222959 – Tweets collected for 2 weeks • 26th October, 2011 to 7th November 2011 • Goal – Recover ground truth clusters – Evaluation based on NMI and RI • Similarity Measures used – Users’ social connections – User mentions – Users’ Description content similarity – Users’ Tweet content similarity Spy plot for Social connections with users ordered by the list to which they belong • Algorithms used – Spectrum based Modularity Maximization – Spectral Algorithm – Normalized Laplacian MatrixCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  26. 26. Analysis on small dataset | Modularity Based Clustering 26 Clusters for spectrum based Clusters for spectrum based Modularity Ground truth clusters modularity maximization clustering on maximization clustering on combined User Connections similarity measure Similarity Matrix Modularity Matrix Analysis • Social connections most dominating for NMI RI clustering this group of users. User Connections 0.3868 0.7174 • Individual similarity measures perform inaccurately Mention 0.0130 0.3398 • Combined similarity measures not as good Tweet content 0.0074 0.3371 as user connections alone • Addition of low information content to user Description content 0.0780 0.5254 connections decreases accuracy. • User behavior not consistent with ground All combined 0.2500 0.6175 truth.Company Proprietary and Confidential Copyright Info Goes Here Just Like • Post similar contentThis
  27. 27. Analysis on small dataset | Laplacian Based Clustering 27 Clusters for Normalized Laplacian based spectral Ground truth clusters clustering on combined similarity measure Symmetric Normalized Analysis Similarity Matrix • Clustering on Social connections fails. Laplacian Matrix • Laplacian based methods are sensitive to NMI RI the presence of disconnected nodes. User Connections 0.0077 0.3374 • Individual similarity measures (including Mention 0.0077 0.3374 social connections) fail to reconstruct any cluster information. Tweet content 0.0077 0.3374 • Combined similarity measures gives results Description content 0.0088 0.3381 consistent with the modularity based approach. All combined 0.2931 0.6472 • Addition of different information to theCompany Proprietary and Confidential Copyright Info Goes Here Just Like social connections makes it connected.This
  28. 28. Community Detection| Analysis on large dataset 28 • Experimental setup – 11273 users from the set of all users collected during data-collection – Tweets collected for 4 weeks • 26th October, 2011 to 22nd November 2011 • Similarity Measures used – Users’ social connections – User mentions – Users’ Hash tag similarity – Users’ Tweet content similarity • Algorithm used – Bottom up Fast Modularity Clustering Spy plot for Social connectionsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  29. 29. Analysis on large dataset| Clustering on Social Connections 29 Spy plot for social connections with Visualization of clustering results users ordered by the clusters that they are present inCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  30. 30. Analysis on large dataset| Clustering on Social Connections 30 Tag cloud 1: Frequent keywords in tweets from cluster 2Visualization of clustering results Tag cloud 2: Frequent keywords in tweets from cluster 6Analysis• Largest cluster, (i.e. cluster 0) contains most of the users from UK and are mostly web developers/software developers and talk consistently about these terms.• Users in cluster 2 talk mostly about technologies like ‘Google’, ‘server’, ‘SQL’ etc. as shown in tag cloud 1• Users in cluster 4 are from same university in India ‘IIIT Hyderabad’.• Users in cluster 6 are football fans as shown in the tag cloud 2. Most of them support Italian club Juventus. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  31. 31. Analysis on large dataset| Clustering on Combined matrices 31Results for data from week 1 Results for week 2 Results for only social connections Analysis • Using combined data leads to much finer clustering results as compared to clustering on social connections. • Additional information allowed making division between users who weren’t tightly connected. • Division into smaller cluster consistent with different weeks Results for week 3 Results for week 4 • Not due to some shifts of interests for a small period of time.Combined and ConfidentialCompany ProprietaryThis = Conection+Mention+Hashtag+Tweet Copyright Info Goes Here Just Like
  32. 32. Contents 32 • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  33. 33. Future Mentions| Reasons for mentions on Twitter 33 • Social Connections – Users can see the tweets of their friends on their wall and therefore are more likely to mention them in their future tweets. – Mentions should occur only if two users share a ‘following ‘or ‘being followed’ relationship • Past mentions – Users who have mentioned each other often in the past are more likely to mention each other in the future . – Past mentions means that the users might have had a conversation on Twitter which means that they share a good relationship. • Hash Tag Similarity – Hash tags are used to highlight important keywords in tweets and make it easy to find tweets or set trending topics on Twitter. – If two users discuss about the same topic/keyword (hashtag) they are more likely to mention each other in future. • Tweet Content Similarity – Users can mention others if they find their tweets to be interesting. – Highly similar tweet content means that there is higher probability of a mention event between two users.Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  34. 34. Future Mentions| Correlation between features 34 and future mentions Correlation between features of week 1 as compared to mentions in week 2 Weighted combination = W1/W2 Mention Hash Tag Tweet Combined Class 2*Mention + 5*Hashtag + Mention 1 0.0528 0.003 0.919 0.1656 Hash Tag 0.0528 1 0.0031 0.4422 0.0565 Tweet Similarity Tweet 0.003 0.0031 1 0.0134 0.0272 Combined Class 0.919 0.1656 0.4422 0.0565 0.0134 0.0272 1 0.1713 0.1713 1 Analysis • Past user mentions has a high correlation with mentions inCorrelation between features of week 1,2 and 3 as compared to mentions in week 4 the next week. W123/W4 Mention Hash Tag Tweet Combined Class • Combined similarity measure Mention 1 0.1428 0.0219 0.8912 0.1906 provides some increase in the Hash Tag 0.1428 1 0.0193 0.5761 0.0861 Tweet 0.0219 0.0193 1 0.0343 -0.006 correlation as compared to past Combined 0.8912 0.5761 0.0343 1 0.1968 mentions. Class 0.1906 0.0861 -0.006 0.1968 1 • We can improve accuracy by increasing the learning data. Correlation between features of week 1 as compared to mentions in week • Correlation for only one cluster 2 only for users of cluster 1 W1/W2 Mention Hash Tag Tweet Combined Class is very good. Mention 1 0.0343 -0.0062 0.7492 0.1616 • Only 1-week learning Hash Tag 0.0343 1 -0.0049 0.6876 0.2192 data outperforms 3 weeks Tweet -0.0062 -0.0049 1 -0.0001 -0.0116 learning data for Combined 0.7492 0.6876 -0.0001 1 0.2625 Class 0.1616 0.2192 -0.0116 0.2625 1 complete set of users. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
  35. 35. Future Work 35 • Landmark detection – Tweets collected from different cities can be used to identify landmark/places of interest in these cities. • Identify future events – Algorithms can be developed to identify future events with the help of tweets collected for different topics. • Combined similarity measure for community detection – Different weighted combinations of similarity measures like mentions, tweet, hashtag, description and social connection etc. can be used to improve clustering results. • Future Mentions – Causes of mentions like past mentions, hashtag similarity etc. can be used to predict future mentions.Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
  36. 36. 36Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis

×