Data Mining on Twitter

5,017 views
4,863 views

Published on

With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.

Published in: Education, Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,017
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
1
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • % of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
  • % of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
  • % of Tweets containing GPS location (0.5-1%) But this is also enough because there are millions of tweets
  • The organisation into groups should be such that similar objects belong to the same cluster whereas there is little or no similarity between objects that belong to different clusters.
  • Lists are a way of grouping users on twitter. Users can follow lists to obtain updates from a group of users. lists @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting respectively.
  • A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data. @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting
  • A reason for the bad performance of the similarity measures based on the tweets, descriptions and mentions can be that the group of users are similar and generally post similar content on the web. This also means that the user behaviours don’t seem to be consistent with the ground truth data. @prolificd/met, @rahulkalra_e/entrepreneurs and @8hasin/mildly-interesting
  • Note that there is no special ordering enforced on the users here so we cannot immediately see some cluster structure in the network.
  • We can now observe a community structure in the graph, i.e. the users have more connections within the community with other users in other communities. Clusters are ordered by the number of users present in each cluster. Red is largest cluster followed by green, blue, purple and cyanThis is just layout. Colors define the distribution of users into clusters. In fact the top 4 communities in the graph cover more than 93% of the total nodes.
  • Use connections, mentions, hash tag, tweet content Used weekly data
  • If two users discuss about the same topic/keyword (hashtag) they are more likely to see each others’ tweets and therefore more likely to share a mention relationship in the future.Tweet Content Similarity: Here we implicitly assume that the users also post something that they are interested in.
  • Data Mining on Twitter

    1. 1. 1 Data Mining and Analysis on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    2. 2. Professor 2 • Prof. Pascal Frossard Project Supervisor • Xiaowen Dong Students • Pulkit Goyal (twitter.com/pulkit110) • Sapan Diwakar (twitter.com/diwakarsapan)Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    3. 3. Contents 3 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    4. 4. Objective 4 • Large amount of new data created every minute on social networking sites. – Difficult to obtain and interpret – Collect data to allow for further analysis • Identify online communities of users on Twitter • Explore reasons of user interactions as a step towards prediction of future interactionsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    5. 5. Contents 5 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    6. 6. Twitter at a glance 6 Micro-blogging platform Since March 2006 Status Update 300 Million users (June, 2011) Giant Chat room Instant MessagingCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    7. 7. Lingo 7 • Tweet - A message of 140 characters or less • Retweet - Repeat a tweet from somebody else • Hashtag - Tweet that includes a #term (tracking) • Reply/Mention - Mentioning another user in a tweetCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    8. 8. Contents 8 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    9. 9. Modules 9 • Data Collection – Setup system to collect data based on some constraints • Visualization – Build some visualizations based on the collected data – Analyze the results • Community Detection – Identify communities of users on Twitter based on several different similarty measures • Analysis of Future Mentions – Identify factors for future mentions between users on twitter.Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    10. 10. Contents 10 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    11. 11. Data Collection | Data based on location 11 • Collect data based on locations: Objectives: – London • Model the spread of interests – New York • Time – Paris • Location – San Francisco • Rate of information flow – Mumbai • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common InterestsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    12. 12. Data Collection | Data based on topics 12 • Collect data based on keywords Objectives: – Apple (Tech) • Model the spread of interests – Manchester United (Soccer) • Time • Location • Rate of information flow • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common InterestsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    13. 13. Data Collection | Data from a group of users 13 • Collect tweets from a "group of users" Objectives: – Group of around 25k users • Model the spread of interests • Time • Created by a specified user • Location • Explicitly in-reply-to a status created by a • Rate of information flow specified user (pressed reply button) • Identify future events • Identify landmarks • Model Relationships among users • Friendship/Social Connections • Common Interests Overview of links we use to collect usersCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    14. 14. Contents 14 • Objective • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future Mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    15. 15. Visualization Results | Streets of London 15 • Setup – Geo-tagged tweets for one week (16 to 22 August 2011) • 111,206 tweetsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    16. 16. Visualization Results | Streets of London | 1 week 16• Analysis • High density of tweets from famous places/tourist attractions • Clustering of tweets • Content of tweets can be used to predict the place • More tweets along the roads/streets National Gallery London Waterloo Rail The Big Ben London Victoria Rail Oval Cricket Ground Greenwich Company Proprietary and Confidential Copyright Info Goes Here Just Like This
    17. 17. 17Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    18. 18. Tweets in London | Aggregated by wards 18 No. of tweets in increasing orderCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    19. 19. Tweets about a topic| Manchester United 19 • Setup – Data for two weeks (27 Oct to 8 Nov 2011) • Keywords – "manchesterunited", "manchester united", "manchester utd", "man united", "manutd", "man utd", "manu", "mufc"Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    20. 20. Visualization Results | Tweets About 20 Manchester UnitedAnalysis • More tweets in and around Europe • Manchester United plays in the English Premiere League and has homeground in Manchester • High amount of tweets from countries whose players play for Manchester United • High popularity of Manchester United in Indonesia and Malaysia Company Proprietary and Confidential Copyright Info Goes Here Just Like This
    21. 21. Tweets about a topic| Apple 21 • Setup – Data for two weeks (27 Oct to 8 Nov 2011) • Keywords – "apple", "mac", "macbook", "macbookair", "macbookpro", "os x", "osx", "osxlion", "ipod", "ipodshuffle", "ipodnano", "ipodclassic", "ipodtouch", "itunes", "iphone", "iphone3", "iphone3s", "iphone4", "iphone4s", "iphone5", "ios", "ios4", "ios5", "ipad", "ipad2", "ipad3"Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    22. 22. Visualization Results | Tweets About Apple 22Analysis • High volume of tweets in USA and Europe • Popularity of apple products in Europe and USA • Volume of data as compared to Manchester United • 32k tweets (with Geo-Location) about Apple as opposed to 1.4k for Manchester United • Interest about Apple spread over the world whereas for Manchester United, it is limited to few countries Company Proprietary and Confidential Copyright Info Goes Here Just Like This
    23. 23. Contents 23 • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    24. 24. Community Detection| Background 24 • Community – A set of users having strong connections. – Held together by some common interests of a large group of users. • Similarity Measures – Users’ Social Connection – User Mentions – Description Content Similarity – Tweet Content Similarity – Hash-Tag Similarity • Algorithms for community detection – Modularity Maximization Clustering • Spectrum Based • Greedy Bottom-up Fast Modularity Clustering – Spectral ClusteringCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    25. 25. Community Detection| Analysis on small dataset 25 • Experimental setup – 501 users from three different lists on twitter • List id 4293757, 12932674 and 33222959 – Tweets collected for 2 weeks • 26th October, 2011 to 7th November 2011 • Goal – Recover ground truth clusters – Evaluation based on NMI and RI • Similarity Measures used – Users’ social connections – User mentions – Users’ Description content similarity – Users’ Tweet content similarity Spy plot for Social connections with users ordered by the list to which they belong • Algorithms used – Spectrum based Modularity Maximization – Spectral Algorithm – Normalized Laplacian MatrixCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    26. 26. Analysis on small dataset | Modularity Based Clustering 26 Clusters for spectrum based Clusters for spectrum based Modularity Ground truth clusters modularity maximization clustering on maximization clustering on combined User Connections similarity measure Similarity Matrix Modularity Matrix Analysis • Social connections most dominating for NMI RI clustering this group of users. User Connections 0.3868 0.7174 • Individual similarity measures perform inaccurately Mention 0.0130 0.3398 • Combined similarity measures not as good Tweet content 0.0074 0.3371 as user connections alone • Addition of low information content to user Description content 0.0780 0.5254 connections decreases accuracy. • User behavior not consistent with ground All combined 0.2500 0.6175 truth.Company Proprietary and Confidential Copyright Info Goes Here Just Like • Post similar contentThis
    27. 27. Analysis on small dataset | Laplacian Based Clustering 27 Clusters for Normalized Laplacian based spectral Ground truth clusters clustering on combined similarity measure Symmetric Normalized Analysis Similarity Matrix • Clustering on Social connections fails. Laplacian Matrix • Laplacian based methods are sensitive to NMI RI the presence of disconnected nodes. User Connections 0.0077 0.3374 • Individual similarity measures (including Mention 0.0077 0.3374 social connections) fail to reconstruct any cluster information. Tweet content 0.0077 0.3374 • Combined similarity measures gives results Description content 0.0088 0.3381 consistent with the modularity based approach. All combined 0.2931 0.6472 • Addition of different information to theCompany Proprietary and Confidential Copyright Info Goes Here Just Like social connections makes it connected.This
    28. 28. Community Detection| Analysis on large dataset 28 • Experimental setup – 11273 users from the set of all users collected during data-collection – Tweets collected for 4 weeks • 26th October, 2011 to 22nd November 2011 • Similarity Measures used – Users’ social connections – User mentions – Users’ Hash tag similarity – Users’ Tweet content similarity • Algorithm used – Bottom up Fast Modularity Clustering Spy plot for Social connectionsCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    29. 29. Analysis on large dataset| Clustering on Social Connections 29 Spy plot for social connections with Visualization of clustering results users ordered by the clusters that they are present inCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    30. 30. Analysis on large dataset| Clustering on Social Connections 30 Tag cloud 1: Frequent keywords in tweets from cluster 2Visualization of clustering results Tag cloud 2: Frequent keywords in tweets from cluster 6Analysis• Largest cluster, (i.e. cluster 0) contains most of the users from UK and are mostly web developers/software developers and talk consistently about these terms.• Users in cluster 2 talk mostly about technologies like ‘Google’, ‘server’, ‘SQL’ etc. as shown in tag cloud 1• Users in cluster 4 are from same university in India ‘IIIT Hyderabad’.• Users in cluster 6 are football fans as shown in the tag cloud 2. Most of them support Italian club Juventus. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
    31. 31. Analysis on large dataset| Clustering on Combined matrices 31Results for data from week 1 Results for week 2 Results for only social connections Analysis • Using combined data leads to much finer clustering results as compared to clustering on social connections. • Additional information allowed making division between users who weren’t tightly connected. • Division into smaller cluster consistent with different weeks Results for week 3 Results for week 4 • Not due to some shifts of interests for a small period of time.Combined and ConfidentialCompany ProprietaryThis = Conection+Mention+Hashtag+Tweet Copyright Info Goes Here Just Like
    32. 32. Contents 32 • Twitter at a glance • Modules • Data Collection • Visualization Results • Community Detection • Future mentions on TwitterCompany Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    33. 33. Future Mentions| Reasons for mentions on Twitter 33 • Social Connections – Users can see the tweets of their friends on their wall and therefore are more likely to mention them in their future tweets. – Mentions should occur only if two users share a ‘following ‘or ‘being followed’ relationship • Past mentions – Users who have mentioned each other often in the past are more likely to mention each other in the future . – Past mentions means that the users might have had a conversation on Twitter which means that they share a good relationship. • Hash Tag Similarity – Hash tags are used to highlight important keywords in tweets and make it easy to find tweets or set trending topics on Twitter. – If two users discuss about the same topic/keyword (hashtag) they are more likely to mention each other in future. • Tweet Content Similarity – Users can mention others if they find their tweets to be interesting. – Highly similar tweet content means that there is higher probability of a mention event between two users.Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    34. 34. Future Mentions| Correlation between features 34 and future mentions Correlation between features of week 1 as compared to mentions in week 2 Weighted combination = W1/W2 Mention Hash Tag Tweet Combined Class 2*Mention + 5*Hashtag + Mention 1 0.0528 0.003 0.919 0.1656 Hash Tag 0.0528 1 0.0031 0.4422 0.0565 Tweet Similarity Tweet 0.003 0.0031 1 0.0134 0.0272 Combined Class 0.919 0.1656 0.4422 0.0565 0.0134 0.0272 1 0.1713 0.1713 1 Analysis • Past user mentions has a high correlation with mentions inCorrelation between features of week 1,2 and 3 as compared to mentions in week 4 the next week. W123/W4 Mention Hash Tag Tweet Combined Class • Combined similarity measure Mention 1 0.1428 0.0219 0.8912 0.1906 provides some increase in the Hash Tag 0.1428 1 0.0193 0.5761 0.0861 Tweet 0.0219 0.0193 1 0.0343 -0.006 correlation as compared to past Combined 0.8912 0.5761 0.0343 1 0.1968 mentions. Class 0.1906 0.0861 -0.006 0.1968 1 • We can improve accuracy by increasing the learning data. Correlation between features of week 1 as compared to mentions in week • Correlation for only one cluster 2 only for users of cluster 1 W1/W2 Mention Hash Tag Tweet Combined Class is very good. Mention 1 0.0343 -0.0062 0.7492 0.1616 • Only 1-week learning Hash Tag 0.0343 1 -0.0049 0.6876 0.2192 data outperforms 3 weeks Tweet -0.0062 -0.0049 1 -0.0001 -0.0116 learning data for Combined 0.7492 0.6876 -0.0001 1 0.2625 Class 0.1616 0.2192 -0.0116 0.2625 1 complete set of users. Company Proprietary and Confidential Copyright Info Goes Here Just Like This
    35. 35. Future Work 35 • Landmark detection – Tweets collected from different cities can be used to identify landmark/places of interest in these cities. • Identify future events – Algorithms can be developed to identify future events with the help of tweets collected for different topics. • Combined similarity measure for community detection – Different weighted combinations of similarity measures like mentions, tweet, hashtag, description and social connection etc. can be used to improve clustering results. • Future Mentions – Causes of mentions like past mentions, hashtag similarity etc. can be used to predict future mentions.Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis
    36. 36. 36Company Proprietary and Confidential Copyright Info Goes Here Just LikeThis

    ×