Zen & the art of data mining

2,756 views

Published on

A talk I gave at Old Dominion University to new students from PES University in Bangalore

Published in: Science, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,756
On SlideShare
0
From Embeds
0
Number of Embeds
1,187
Actions
Shares
0
Downloads
28
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Zen & the art of data mining

  1. 1. Old Dominion University Department of Computer Science Hany SalahEldeen Hany SalahEldeen Khalil hany@cs.odu.edu Zen & the Art of Data Mining 07-08-14 Social Media Data Collection and the path to Modeling & Predicting User Intention Web Science & Digital Libraries Lab 1
  2. 2. Before we start.. here is a lil bit about me… Hany SalahEldeen 2
  3. 3. Hany SalahELdeen Education: • PhD Candidate • Web Science and Digital Libraries Group • Masters Degree in Computer Vision and Artificial Intelligence • Universitat Autonoma de Barcelona • Bachelors of Computer Systems Engineering • University of Alexandria Hany SalahEldeen 3
  4. 4. Research & Technical Experience • Microsoft Research Cairo • Google GmBH Zurich • Microsoft Inc. Mountain View • National University of Singapore Hany SalahEldeen 4
  5. 5. Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media Web Mining Pattern Analysis Machine Learning Human Behavioral Analysis Social Media Analysis So what am I investigating? 5
  6. 6. Publications Hany SalahEldeen Shanghai CIKM 2014 Conference - 1 first author paper - 1 second author paper London DL 2014 Conference - 1 third author paper Malta TPDL 2013 Conference - 1 first author paper 6
  7. 7. Publications Hany SalahEldeen Indianapolis JCDL 2013 Conference - 1 first author paper Rio de Janeiro WWW 2013 Conference - 1 first author paper Cyprus TPDL 2012 Conference - 1 first author paper 7
  8. 8. Beside the perks of travelling, our research has been popular… Hany SalahEldeen 8
  9. 9. MIT Technology Review Hany SalahEldeen 9
  10. 10. MIT Technology Review Hany SalahEldeen 10
  11. 11. MIT Technology Review Hany SalahEldeen 11
  12. 12. Mashable Hany SalahEldeen 12
  13. 13. Popular Mechanics Hany SalahEldeen 13
  14. 14. BBC Hany SalahEldeen 14
  15. 15. The Virginian Pilot Hany SalahEldeen 15
  16. 16. Our Research’s Popularity Hany SalahEldeen • Local newspaper: The Virginia Pilot • 4 x MIT Technology Review • BBC • Mashable • The Atlantic • Yahoo News • Articles in > 11 different languages • We have been called: • The Internet Archeologists • Web Time Travelers 16
  17. 17. My goal: Detect, model, and predict user intention in social media Hany SalahEldeen 17
  18. 18. Ok hold on, let’s go back to the basics… Hany SalahEldeen 18
  19. 19. Web 2.0 Definition: Web 2.0 is a concept that takes the network as a platform for information sharing, interoperability, user-centered design, and collaboration on the World Wide Web.* * http://en.wikipedia.org/wiki/Web_2.0 Hany SalahEldeen 19
  20. 20. Web 2.0 • Yes, Web 2.0 is about “user-generated content” • But explicit content contributed by users is just 20% of what “matters” • 80% is in the implicitly contributed data* Hany SalahEldeen 20 *Toby Segaran, Programming Collective Intelligence, 2007
  21. 21. Systems & Web 2.0 • Google: Utilizes PageRank which is a technique for extracting intelligence from the link structure • Flickr: Utilizes “interestingness” algorithm • Amazon: Utilizes “people who bought this product also bought” feature • Pandora: Utilizes “similar artist radio” • eBay: Utilizes “reputation system” Hany SalahEldeen 21
  22. 22. So why do we even care about all that? Hany SalahEldeen 22
  23. 23. Power to the People! Hany SalahEldeen 23
  24. 24. Power to the People! • Because analyzing a huge dataset of millions of users will yield a lot of potential insights into: • User Experience • Marketing • Personal Taste • Human Behavior in general. Hany SalahEldeen 24
  25. 25. So what is Data Mining? Hany SalahEldeen 25
  26. 26. Data Mining • Definition: It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. http://en.wikipedia.org/wiki/Data_mining Hany SalahEldeen 26
  27. 27. Back to my goal: Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 27
  28. 28. Let’s breakdown the title first… Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 28
  29. 29. Let’s breakdown the title first… Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 29
  30. 30. Scenario 1: Jenny reading Jeff’s tweets Hany SalahEldeen 30
  31. 31. Michael Jackson Dies Hany SalahEldeen Snapshot on: June 25th 2009 http://web.archive.org/web/20090625232522/http://www.cnn.com/ 31
  32. 32. Jeff tweets about it… Hany SalahEldeen Published on: June 25th 2009 https://twitter.com/mdnitehk/status/2333993907 32
  33. 33. Jeff’s friend Jenny was on a vacation in Hawaii for a month Jenny is off the grid… Hany SalahEldeen 33
  34. 34. When she came back she checked Jeff’s tweets and was shocked! Jenny starts catching up a month later Hany SalahEldeen Read on: July26th 2009 https://twitter.com/mdnitehk/status/2333993907 34
  35. 35. She quickly clicked on the link in the tweet… Jenny follows the link on July 26th Hany SalahEldeen http://web.archive.org/web/20090726234411/http://www.cnn.com/ CNN page on: July 26th 2009 35
  36. 36. • Implication: • Jenny thought Jeff is making a joke about her favorite singer and she got mad at him • Problem: • The tweet and the resource the tweet links to have become unsynchronized. Jenny is confused! Hany SalahEldeen 36
  37. 37. Scenario 2: The Egyptian Revolution Hany SalahEldeen 37
  38. 38. The Egyptian Revolution Jan 2011 Hany SalahEldeen 38
  39. 39. Reading about it in Storify.com a year later in March 2012 Hany SalahEldeen http://storify.com/maq4sure/egypts-revolution 39
  40. 40. I noticed some shared images are missing Hany SalahEldeen http://storify.com/maq4sure/egypts-revolution 40
  41. 41. Some tweets are still intact Hany SalahEldeen https://twitter.com/miss_amy_qb/status/32477898581483521 41
  42. 42. …and some lost their meaning with the disappearance of the images Hany SalahEldeen Missing ? https://twitter.com/aishes/status/32485352102952960 https://twitter.com/omar_chaaban/status/32203697597452289 42
  43. 43. The tweet remains but the shared image disappeared… Hany SalahEldeen http://yfrog.com/h5923xrvbqqvgzj 43
  44. 44. • Implication: • The reader cannot understand what the author of the tweet meant because the image is not available. • Problem: • The post is available but the linked resource (image) is completely missing. Cairo….we have a problem! Hany SalahEldeen 44
  45. 45. …back to the title Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 45
  46. 46. …back to the title Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 46
  47. 47. 47 The Anatomy of a Tweet Hany SalahEldeen 47
  48. 48. 48 The Anatomy of a Tweet Author’s username Other user mention Tweet Body Hash TagShortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options Hany SalahEldeen 48
  49. 49. 49 3 URIs = 3 Chances to fail Hany SalahEldeen http://news.blogs.cnn.com/2012/04/26/norwegian s-sing-to-annoy-mass-killer/ https://twitter.com/KentEiler/status/19553574 9754527745 49
  50. 50. 50 … t1 t4 t2 t3 t5 t7 t8 t9 tn t6 Explanation in MJ’s example 50
  51. 51. 51 If I click on a link in a tweet, which version should I get? ttweet or tclick ? Hany SalahEldeen 51
  52. 52. 52 Sometimes you want a previous version The Correct Temporal Intention CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pm Hany SalahEldeen 52
  53. 53. 53 Sometimes you want the current version The Correct Temporal Intention In this case the current state of the press releases page Hany SalahEldeen 53
  54. 54. 54 Research Question Can we estimate the users’ intention at the time of posting and reading to predict and maintain temporal consistency? Hany SalahEldeen 54
  55. 55. 55 People rely on social media for most updated information Hany SalahEldeen 55
  56. 56. Hany SalahEldeen So if you are posting a tweet about your cat… …No one cares! 56
  57. 57. Hany SalahEldeen Regardless how cool your cat was! 57
  58. 58. All tweets are equal… …but some are more equal than the others Hany SalahEldeen 58
  59. 59. Preliminary Research Questions: 1. How long would these last? 2. And if lost, are they archived? 3. Is this what the author intended? Hany SalahEldeen 59
  60. 60. 60 Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised. Hany SalahEldeen Historical Integrity 60
  61. 61. 61 The life cycle of a social post Hany SalahEldeen 61
  62. 62. 62 The life cycle of a social post tweets Hany SalahEldeen 62
  63. 63. 63 The life cycle of a social post tweets Links to Hany SalahEldeen 63
  64. 64. 64 The life cycle of a social post tweets What the reader receives Links to Same state the author intended Hany SalahEldeen 64
  65. 65. 65 The life cycle of a social post tweets What the reader receives Links to Same state the author intended Hany SalahEldeen The resource has disappeared 65
  66. 66. 66 The life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared The resource has changed Hany SalahEldeen 66
  67. 67. 67 Same state the author intended The Resource’s Possibilities a bigger problem since the reader might not know. What the reader receives The resource has disappeared The resource has changed Hany SalahEldeen 67
  68. 68. 68 We could lose the linked resource Hany SalahEldeen 68
  69. 69. 69 The attack on the embassy was in February 2013 Or the resource could change Hany SalahEldeen 69
  70. 70. 70 Why do we want to detect the Author’s Temporal Intention? • Match: and convey the intended information. • Notify: – the author that the resource is prone to change. – the reader that the resource has changed. • Preserve: the resource by pushing snapshots into the archive automatically. • Retrieve: the closest archived version to maintain the consistency. Hany SalahEldeen 70
  71. 71. 71 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 71
  72. 72. 72 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 72
  73. 73. 73 Estimating Web Archiving Coverage • Goal: Estimate how much of the public web is present in the public archives and how many copies are available? • Action: – Getting 4 different datasets from 4 different sources: • Search Engines Indices • Bit.ly • DMOZ • Delicious. • Results: * • Publications: – How much of the web is archived? JCDL '11 – http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is- archived.html Hany SalahEldeen 16%-79% Archived according to the source 73
  74. 74. 74 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 74
  75. 75. 75 The timeline of the resource Hany SalahEldeen 75 http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
  76. 76. 76 Timestamps Accumulation Hany SalahEldeen 76
  77. 77. 77 Actual Vs. Estimated Dates Hany SalahEldeen • Successfully estimated the creation date >75% of the resources • >33% we estimated the exact date 77
  78. 78. 78 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 78
  79. 79. • From Twitter, Websites, Books: • The Egyptian revolution • From Twitter Only: • Stanford’s SNAP dataset: • Iranian elections • H1N1 virus outbreak • Michael Jackson’s death • Obama’s Nobel Peace Prize • Twitter API: • The Syrian uprising Six Socially Significant Events Hany SalahEldeen 79
  80. 80. Resources Missing & Archived Hany SalahEldeen 80
  81. 81. Revisiting after a year… Hany SalahEldeen • There is a nearly linear relationship between the amount missing from the web and time. • After 1 year ~11% is gone, and 0.02% is lost every day 81
  82. 82. Measured Vs. Predicted Hany SalahEldeen 82
  83. 83. First Attempts to Shared Content Replacement Hany SalahEldeen 83 • We performed an experiment to gauge how many of the resources that are missing could be replaced with other similar resources. • Collected a dataset with available resources which we assumed to be missing • Used our method to extract the replacement resources • Measured the similarity with the original resource
  84. 84. First Attempts to Shared Content Replacement Hany SalahEldeen We were able to extract another resource with >70% similarity to the missing resource in >40% of the cases 84
  85. 85. 85 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 85
  86. 86. 86 Temporal Intention Relevancy Model (TIRM) Between ttweet and tclick: The linked resource could have: • Changed • Not changed The tweet and the linked resource could be: • Still relevant • No longer relevant Hany SalahEldeen 86
  87. 87. 87 Resource is changed but relevant • The resource changed • But it is still relevant  Intention: need the current version of the resource at any time Hany SalahEldeen 87
  88. 88. 88 Relevancy and Intention Mapping Current Hany SalahEldeen 88
  89. 89. 89 Resource is changed and not relevant  Intention: need the past version of the resource at any time • The resource changed • But it is no longer relevant Hany SalahEldeen 89
  90. 90. 90 Past Relevancy and Intention Mapping Current Hany SalahEldeen 90
  91. 91. 91 Resource is not changed and relevant  Intention: need the past version of the resource at any time • The resource is not changed • And it is relevant Hany SalahEldeen 91
  92. 92. 92 Past Relevancy and Intention Mapping Current Past Hany SalahEldeen 92
  93. 93. 93 Resource is not changed and not relevant  Intention: I am not sure which version of the resource I need • The resource is not changed • But it is not relevant Hany SalahEldeen 93
  94. 94. 94 Past Relevancy and Intention Mapping Current Past Not Sure Hany SalahEldeen 94
  95. 95. 95 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 95
  96. 96. 96 Feature extraction • For each tweet we perform: – Link analysis – Social Media Mining – Archival Existence – Sentiment Analysis – Content Similarity – Entity Identification Hany SalahEldeen 96
  97. 97. 97 1- Link analysis • Since the tweets have embedded resources shortened by Bit.ly we can extract: – Total number of clicks – Hourly click logs – Creation dates – Referring websites – Referring countries • We calculate the depth of the resource in relation to its domain (either it is a leaf node or a root page) – We calculated the number of backslashes in the resource’s URI Hany SalahEldeen 97
  98. 98. 98 2- Social Media Mining • Twitter: – Using Topsy.com’s API to extract: • Total number of tweets. • The most recent 500. • Number of tweets by influential users. The collection of tweets extracted provided an extended context of the resource authored by users in the twittersphere. Hany SalahEldeen 98
  99. 99. 99 2- Social Media Mining • Facebook: – Mined too for likes, shares, posts, and clicks related to each resource. Hany SalahEldeen 99
  100. 100. 100 3- Archival Existence • Using Memento Time Maps we get: – Total mementos available – Different archives count. – The closest archived version to the tweet time. Hany SalahEldeen 100
  101. 101. 101 4- Sentiment Analysis • Using NLTK libraries of natural language text processing • Extract the most prominent sentiment in the text Hany SalahEldeen 101
  102. 102. 102 5- Content Similarity • Steps: – We download the content HTML using Lynx browser. – We apply boilerplate removal algorithm and full text extraction. – Calculate the cosine similarity between the two pages.  70% similarity  Hany SalahEldeen 102
  103. 103. 103 6- Entity Identification • By visual inspection we observed that the majority of tweets about celebrities are related to current events. • We harvested Wikipedia for lists of actors, politicians, and athletes. • Checked the existence of a celebrity mention in the tweets. Actor: Johnny Depp Hany SalahEldeen 103
  104. 104. 104 The trained classifier • From the feature extraction phase we extracted 39 different features to train the classifier. • Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32% Hany SalahEldeen 104
  105. 105. 105 What’s Next for Hany? • Finish up my dissertation • Defend. • Get a research/Data scientist position • Interests: – L3S Research Center Germany – Microsoft Research Hany SalahEldeen 105
  106. 106. 106 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen Summary: Email: hany@cs.odu.edu Office: 3102 Website: http://www.cs.odu.edu/~hany/ Twitter: @hanysalaheldeen 106

×