Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Doctoral Defense: Hany SalahEldeen

1,191 views

Published on

Detecting, Modeling, and Predicting, User Temporal Intention in Social Media

Published in: Science
  • Be the first to comment

  • Be the first to like this

Doctoral Defense: Hany SalahEldeen

  1. 1. 2015 Hany SalahEldeen Dissertation Defense 1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy Dissertation Defense Old Dominion University Department of Computer Science Advisor: Dr. Michael L. Nelson Dr. Michele C. Weigle Dr. Hussein M. Abdel-Wahab Dr. M’Hammed Abdous Committee: May 5th, 2015
  2. 2. 2015 Hany SalahEldeen Dissertation Defense 2 All tweets are equal… …but some are more equal than the others
  3. 3. 2015 Hany SalahEldeen Dissertation Defense 3 It is imperative to know… 1. How long would these last? 2. And if lost, is there a backup somewhere? 3. Is this what the author intended?
  4. 4. 2015 Hany SalahEldeen Dissertation Defense 4 To maintain historical integrity Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.
  5. 5. 2015 Hany SalahEldeen Dissertation Defense 5 Motivation Background Related Research Research Question User-Time-Shared Resource Conclusions
  6. 6. 2015 Hany SalahEldeen Dissertation Defense 6 People rely on social media for most updated information
  7. 7. 2015 Hany SalahEldeen Dissertation Defense 7 Social media is more than kitty photos Marie Colvin January 12, 1956 – February 22, 2012 Rémi Ochlik 16 October 1983 – 22 February 2012 Ahmed Assem 1987 – July 8, 2013
  8. 8. 2015 Hany SalahEldeen Dissertation Defense 8 For the web is dark, and full of missing content… Accessed in July 2014 3 out 8 external links on Remi’s Wikipedia page return 404
  9. 9. 2015 Hany SalahEldeen Dissertation Defense 9 even for content shared in social media Accessed in July 2014
  10. 10. 2015 Hany SalahEldeen Dissertation Defense 10 News sites are also prone to change Accessed in July 2014
  11. 11. 2015 Hany SalahEldeen Dissertation Defense 11 So are specialized sites Accessed in July 2014
  12. 12. 2015 Hany SalahEldeen Dissertation Defense 12 Research Problem: Author’s Intention ≠ Reader’s Experience
  13. 13. 2015 Hany SalahEldeen Dissertation Defense 13 Research Implication Author’s Intention ≠ Reader’s Experience Broken Inconsistent Web and Historical Records
  14. 14. 2015 Hany SalahEldeen Dissertation Defense 14 Motivation Background Related Research Research Question User-Time-Shared Resource Conclusions
  15. 15. 2015 Hany SalahEldeen Dissertation Defense 15 Social Post
  16. 16. 2015 Hany SalahEldeen Dissertation Defense 16 The anatomy of a tweet Author’s username Other user mention Tweet Body Hash TagShortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options
  17. 17. 2015 Hany SalahEldeen Dissertation Defense 17 3 URIs = 3 Chances to fail
  18. 18. 2015 Hany SalahEldeen Dissertation Defense 18 URL shortening and aliasing curl -L -I http://bit.ly/losing_revolution HTTP/1.1 301 Moved Permanently Server: nginx Date: Mon, 07 Jul 2014 18:19:48 GMT Cache-Control: private; max-age=90 Location: http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html Mime-Version: 1.0 Set-Cookie: _bit=53bae4c4-00328-04f10- cb1cf10a;domain=.bit.ly;expires=Sat Jan 3 18:19:48 2015;path=/; HttpOnly Content-Type: text/html;charset=utf-8 Content-Length: 167 HTTP/1.1 200 OK Expires: Mon, 07 Jul 2014 18:19:52 GMT Date: Mon, 07 Jul 2014 18:19:52 GMT Cache-Control: private, max-age=0 Last-Modified: Mon, 07 Jul 2014 18:19:07 GMT ETag: "e3555826-b103-4daa-a3f2- d0509ebab51f" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Server: GSE Alternate-Protocol: 80:quic Content-Type: text/html;charset=UTF-8 Content-Length: 0
  19. 19. 2015 Hany SalahEldeen Dissertation Defense 19 Life cycle of a social post
  20. 20. 2015 Hany SalahEldeen Dissertation Defense 20 Life cycle of a social post tweets
  21. 21. 2015 Hany SalahEldeen Dissertation Defense 21 Life cycle of a social post tweets Links to
  22. 22. 2015 Hany SalahEldeen Dissertation Defense 22 Life cycle of a social post tweets What the reader receives Links to Same state the author intended
  23. 23. 2015 Hany SalahEldeen Dissertation Defense 23 Life cycle of a social post tweets What the reader receives Links to Same state the author intended Ideally!
  24. 24. 2015 Hany SalahEldeen Dissertation Defense 24 Life cycle of a social post tweets What the reader receives Links to Same state the author intended After a period of time
  25. 25. 2015 Hany SalahEldeen Dissertation Defense 25 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared After a period of time
  26. 26. 2015 Hany SalahEldeen Dissertation Defense 26 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared The resource has changed After a period of time
  27. 27. 2015 Hany SalahEldeen Dissertation Defense 27 Memento framework * http://mementoweb.org/guide/rfc/
  28. 28. 2015 Hany SalahEldeen Dissertation Defense 28 Motivation Background Related Research Research Question User-Time-Shared Resource Conclusions
  29. 29. 2015 Hany SalahEldeen Dissertation Defense 29 Related Work • Social media analysis: • Understanding Microblogging • Zhao 2009 • Yang 2010 • Newman 2003 • Kwak 2010 • Java 2007 • Cha 2009 • History Narration • Vieweg 2010 • Starbird 2010-2012 • Qu 2011 • Neubig 2011 • Lehman and Lalmas 2012- 2013 • User’s Web Search Intention • Ashkan 2009 • Lee 2005 • Loser 2008 • Azzopardi 2009 • Baeza-Yates 2006 • Dai 2011 • Commercial Intention • Guo 2010 • Benczur 2007 • Sentiment Analysis • Mishne 2006 • Bollen 2011 • Access to Archives • Van de Sompel 2009 • Persistence of shared resources – Nelson 2002 – Sanderson 2011 – McCown 2007 • URL Shortening – Antoniades 2011 • Tweeting, Micro-blogging and Popularity – Wu 2011 – Java 2007 – Kwak 2010 • Social Networks Growth and Evolution – Meeder 2011 Further details: refer to chapter 3
  30. 30. 2015 Hany SalahEldeen Dissertation Defense 30 Motivation Background Related Research Research Question User-Time-Shared Resource Conclusions
  31. 31. 2015 Hany SalahEldeen Dissertation Defense 31 Research Question: Can we estimate the users’ intention at the time of posting and reading to predict and maintain temporal consistency?
  32. 32. 2015 Hany SalahEldeen Dissertation Defense 32 Research Goals • Detect the temporal intention of the: 1. Author upon sharing time 2. The reader upon dereferencing time • Model this intention as a function of time, nature of the resource, and its context. • Predict how resources change with time and the intention behind sharing them to minimize inconsistency. • Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9
  33. 33. 2015 Hany SalahEldeen Dissertation Defense 33 Motivation Background Related Research Research Question User-Time-Shared Resource Conclusions
  34. 34. 2015 Hany SalahEldeen Dissertation Defense 34 Shared Resource Time User Our analysis covers three angles
  35. 35. 2015 Hany SalahEldeen Dissertation Defense 35 Shared Resource Time User Loss and Persistence of Shared Resources
  36. 36. 2015 Hany SalahEldeen Dissertation Defense 36 Shared Resource Time User Alive First: Estimate social media content loss
  37. 37. 2015 Hany SalahEldeen Dissertation Defense 37 Six socially significant events Event Source Year Iranian Election SNAP Dataset 2009 H1N1 Virus Outbreak SNAP Dataset 2009 Michael Jackson’s Death SNAP Dataset 2009 Obama’s Nobel Peace Prize SNAP Dataset 2009 The Egyptian Revolution Twitter, Websites, Books 2011 The Syrian Uprising Twitter API 2012
  38. 38. 2015 Hany SalahEldeen Dissertation Defense 38 Twitter tag expansion and filtration
  39. 39. 2015 Hany SalahEldeen Dissertation Defense 39 Twitter tag expansion increases precision
  40. 40. 2015 Hany SalahEldeen Dissertation Defense 40 What are people sharing?
  41. 41. 2015 Hany SalahEldeen Dissertation Defense 41 Existence on the live web and in the archives • For each unique URL we resolved the final HTTP response and considered 2 classes: • Success: 200 OK • Failure: 4XX, 50X families and the 30X loop redirects or soft 404s. • Utilize the memento aggregator: • Archived: if it has at least one memento in the timemap
  42. 42. 2015 Hany SalahEldeen Dissertation Defense 42 Resources Missing and Archived Collection Percentage Missing Percentage Archived 23.49%H1N1 Outbreak 41.65% 36.24%Michael Jackson 39.45% 26.98%Iran 43.08% 24.59%Obama 47.87% 10.48%Egypt 20.18% 7.04%Syria 5.35% 31.62% 30.78% 24.47% 36.26% 25.64% 43.87% 26.15% 46.15%
  43. 43. 2015 Hany SalahEldeen Dissertation Defense 43 Shared Resource Time User Alive Missing Second: Can we measure existence and disappearance as a function of time?
  44. 44. 2015 Hany SalahEldeen Dissertation Defense 44 Resources Missing and Archived Collection Percentage Missing Percentage Archived 23.49%H1N1 Outbreak 41.65% 36.24%Michael Jackson 39.45% 26.98%Iran 43.08% 24.59%Obama 47.87% 10.48%Egypt 20.18% 7.04%Syria 5.35% 31.62% 30.78% 24.47% 36.26% 25.64% 43.87% 26.15% 46.15%
  45. 45. 2015 Hany SalahEldeen Dissertation Defense 45 Timeline of Events
  46. 46. 2015 Hany SalahEldeen Dissertation Defense 46 Timeline of Events
  47. 47. 2015 Hany SalahEldeen Dissertation Defense 47 Social Events Having a Bimodal Time Distribution
  48. 48. 2015 Hany SalahEldeen Dissertation Defense 48 Timeline of Events
  49. 49. 2015 Hany SalahEldeen Dissertation Defense 49 Social Events Having a Bimodal Time Distribution
  50. 50. 2015 Hany SalahEldeen Dissertation Defense 50 Existence as a function of time
  51. 51. 2015 Hany SalahEldeen Dissertation Defense 51 Existence as a function of time
  52. 52. 2015 Hany SalahEldeen Dissertation Defense 52 • Results: • Publications and Articles: 1. H. M. SalahEldeen. Losing My Revolution: A year after the Egyptian Revolution, 10% of the social media documentation is gone. http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html , 2012. 2. H. M. SalahEldeen and M. L. Nelson. Losing my revolution: how many resources shared on social media have been lost? In Proceedings of the Second international conference on Theory and Practice of Digital Libraries, TPDL'12, 2012. Conclusion: Existence could be estimated as a function of time • Measured 21,625 resources from 6 data sets in archives & live web. • After a year from publishing about 11% of content shared on social media will be gone. • After this we are losing roughly 0.02% daily.
  53. 53. 2015 Hany SalahEldeen Dissertation Defense 53 Revisiting Existence after a year MJ Iran H1N1 Obama Egypt Syria Measured 37.10% 37.50% 28.17% 30.56% 26.29% 31.62% 32.47% 24.64% 7.55% 12.68% Predicted 31.72% 31.42% 31.96% 30.98% 30.16% 29.68% 29.60% 28.36% 19.80% 11.54% Error 5.38% 6.08% 3.79% 0.42% 3.87% 1.94% 2.87% 3.72% 12.25% 1.14% MJ Iran H1N1 Obama Egypt Syria Measured 48.61% 40.32% 60.80% 55.04% 47.97% 52.14% 48.38% 40.58% 23.73% 0.56% Predicted 61.78% 61.18% 62.26% 60.30% 58.66% 57.70% 57.54% 55.06% 37.94% 21.42% Error 13.17% 20.86% 1.46% 5.26% 10.69% 5.56% 9.16% 14.48% 14.21% 20.86% Average Prediction Error = 11.57% in all cases, our archival predictions were too optimistic Missing Archived Average Prediction Error = 4.15% in all cases, our missing predictions were acceptable
  54. 54. 2015 Hany SalahEldeen Dissertation Defense 54 Shared Resource Time User Alive Missing Replaced Third: Can we use social context to find replacements of missing resources?
  55. 55. 2015 Hany SalahEldeen Dissertation Defense 55 Context discovery and shared resource replacement Problem: 140 characters limits the description of the linked resource. If it went missing, can we get the next best thing? Solution: • Shared links typically have several tweets, responses, and retweets • We can mine these traces for context and viable replacements
  56. 56. 2015 Hany SalahEldeen Dissertation Defense 56 Context Discovery Linking to: http://beta.18daysinegypt.com/
  57. 57. 2015 Hany SalahEldeen Dissertation Defense 57 What if the resource disappeared? Linking to: http://beta.18daysinegypt.com/
  58. 58. 2015 Hany SalahEldeen Dissertation Defense 58 Use Topsy to discover tweets sharing the same link
  59. 59. 2015 Hany SalahEldeen Dissertation Defense 59 Social Context Extraction { "URI": "http://beta.18daysinegypt.com/", "Related Tweet Count": 500, "Related Hashtags": "#tran #citizensx #arabspring #visualstorytelling #collaborativerevolution #feb11http://t.co/qxusp70 ...", "Users who talked about this": "@petra_stienen: @waleedrashed: @omarsamra @ungormite: @dcisbusy @webdocumentario: ...", "All associated unique links:": "http://t.co/63X1f3f1 http://t.co/reBh6c4V http://t.co/B3GuhQN4 http://t.co/X2sjf4Rf http://t.co/P9iR28fH http://t.co/1C4EPh8h ...", "All other links associated:": "http://vimeo.com/35368376 http://mashable.com/2012/01/21/18daysinegypt-2/ ", "Most frequent link appearing:": "http://t.co/2ke0rEjP", "Number of times the Most frequent link appearing:": 49, "Most frequent tweet posted and reposted:": "Check out 18DaysInEgypt - A crowd sourced documentary project ================= via @18daysinegypt", "Number of times the Most frequent tweet appearing:": 46, "The longest common phrase appearing:": "RT 2ke0rEjP is an interactive documentary website that YOU can help create Get your Jan25 stories ready! Pl RT", "Number of times the Most common phrase appearing:": 18 }
  60. 60. 2015 Hany SalahEldeen Dissertation Defense 60 Build a Tweet Document A tweet document represents the concatenation of all extracted tweets: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “
  61. 61. 2015 Hany SalahEldeen Dissertation Defense 61 Tweet Signature Tweet Document: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “ Tweet Signature = top 5 most frequent terms from Tweet Document documentary project daysinegypt check sourced
  62. 62. 2015 Hany SalahEldeen Dissertation Defense 62 Query Google with the Tweet Signature
  63. 63. 2015 Hany SalahEldeen Dissertation Defense 63 Search Engine Results The original resource
  64. 64. 2015 Hany SalahEldeen Dissertation Defense 64 Search Engine Results The original resource The others are good replacement candidates
  65. 65. 2015 Hany SalahEldeen Dissertation Defense 65 Recommendation Evaluation We extract a dataset of resources that are currently available: • Pretend these resources no longer exist (for a baseline) • Each of the resources are textual based • Each resource has at least 30 retrievable tweets.  Extracted 731 unique resources We use boiler plate removal library to remove the template from the: • linked resources • top 10 retrieved results from Google We use cosine similarity to compare the documents
  66. 66. 2015 Hany SalahEldeen Dissertation Defense 66 Similarity measures in resource replacement ----70% similarity---- 41% of the cases we found a replacement with >=70% similarity
  67. 67. 2015 Hany SalahEldeen Dissertation Defense 67 Conclusion: We can find viable replacements for missing shared resources • Results: • 41% of the test cases we can find a replacement page with at least 70% similarity to the original missing resource • The search results provide a mean reciprocal rank of 0.43 • Publications: 1. H. SalahEldeen and M. L. Nelson. Resurrecting my revolution: Using social link neighborhood in bringing context to the disappearing web. In Research and Advanced Technology for Digital Libraries- International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.
  68. 68. 2015 Hany SalahEldeen Dissertation Defense 68 Now we finished analyzing the shared resource…what’s next?
  69. 69. 2015 Hany SalahEldeen Dissertation Defense 69 Shared Resource Time User Alive Missing Replaced Footprints on the web
  70. 70. 2015 Hany SalahEldeen Dissertation Defense 70 The tweet, the resource…and time time Posted a tweet Read the tweetRelevancy of the resource to the tweet changed through time  we need to measure that Another tweet posted And another … We need to measure tweet relevance through time
  71. 71. 2015 Hany SalahEldeen Dissertation Defense 71 Shared Resource Time User Alive Missing Replaced Rate of Change Longitudinal Study: Rate of change of shared content
  72. 72. 2015 Hany SalahEldeen Dissertation Defense 72 Pilot 1: Resource change in the first 80 hours after tweeting
  73. 73. 2015 Hany SalahEldeen Dissertation Defense 73 Pilot 2: Delta days from Bitly creation for just tweeted content Dataset size = 4,000
  74. 74. 2015 Hany SalahEldeen Dissertation Defense 74 Pilot 3: Dataset of 1,000 freshly created Bitlys http://www.cnn.com  depth = 0 http://www.cnn.com/world  depth = 1 http://www.cnn.com/2009/SHOWBIZ/Music/06/25/jackson  depth = 6
  75. 75. 2015 Hany SalahEldeen Dissertation Defense 75 What domains do users link to?
  76. 76. 2015 Hany SalahEldeen Dissertation Defense 76 What categories* do users link to? * Extracted from Alexa.com
  77. 77. 2015 Hany SalahEldeen Dissertation Defense 77 Summation of Intention in Social Content Through Time Longitudinal study: We record the change over an extended period of time: • Content: we download a snapshot of the resource every 45 minutes • Metadata: we collect meta data about the resource • Facebook likes, posts • Tweets in the last hour • Bitly clicklogs and shares • Average data size: ~1 TB per month
  78. 78. 2015 Hany SalahEldeen Dissertation Defense 78 Hourly analysis over an extended period of time
  79. 79. 2015 Hany SalahEldeen Dissertation Defense 79 There is a difference between ttweet and tclick • After just one hour, 4% of the resources have changed by 30%. • After six hours, the percentage doubled to be 8% changed by 40%. • After a day the change rate slowed to be 12% of the resources changed by 40%. • After that it almost stabilizes at 17% of the resources to be changed by 40%.
  80. 80. 2015 Hany SalahEldeen Dissertation Defense 80 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation First: Resource – Time – Public Archives
  81. 81. 2015 Hany SalahEldeen Dissertation Defense 81 Revisited: Resources Missing and Archived Collection Percentage Missing Percentage Archived 23.49%H1N1 Outbreak 41.65% 36.24%Michael Jackson 39.45% 26.98%Iran 43.08% 24.59%Obama 47.87% 10.48%Egypt 20.18% 7.04%Syria 5.35% 31.62% 30.78% 24.47% 36.26% 25.64% 43.87% 26.15% 46.15%
  82. 82. 2015 Hany SalahEldeen Dissertation Defense 82 But on a more general notion we want to know…
  83. 83. 2015 Hany SalahEldeen Dissertation Defense 83 How much of the web is archived? • Goal: Estimate how much of the public web is present in the public archives and how many copies are available? • Action: • Getting 4 different datasets from 4 different sources: • Search Engines Indices • Bit.ly • DMOZ • Delicious.
  84. 84. 2015 Hany SalahEldeen Dissertation Defense 84 Conclusion: It depends on the source • Results: • Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.
  85. 85. 2015 Hany SalahEldeen Dissertation Defense 85 Conclusion: It depends on the source • Results: • Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM. Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%
  86. 86. 2015 Hany SalahEldeen Dissertation Defense 86 Side Experiment: Analyzing the quality of the archives and the archived content • Goal: • Assessing the quality of the web archives • Better discussed in Justin Brunelle’s work • Publications: 1. J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014 (Best student paper award)
  87. 87. 2015 Hany SalahEldeen Dissertation Defense 87 A question emerged: When did a certain resource first appear on the web?
  88. 88. 2015 Hany SalahEldeen Dissertation Defense 88 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation Second: When was the resource created?
  89. 89. 2015 Hany SalahEldeen Dissertation Defense 89 Idea Web pages leave trails as well since the day they were created…
  90. 90. 2015 Hany SalahEldeen Dissertation Defense 90 Web Resource Web trails A web page could leave a trail of one of the following denoting its existence: • References • Links (anchors) • Social media likes and interactions. • URL shortening. • Backlinks • The creation date of any of the associated events/trails could be an estimate of the creation date.
  91. 91. 2015 Hany SalahEldeen Dissertation Defense 91 Resource’s timeline
  92. 92. 2015 Hany SalahEldeen Dissertation Defense 92 Observations Recorded 1.Last modified date from the response header. 2.First Appearance of a backlink. 3.First Tweet published. 4.First Bitly Shortened URL created. 5.Time stamp of first memento in the archives. 6.Date of the last crawl by the search engine.
  93. 93. 2015 Hany SalahEldeen Dissertation Defense 93 Carbon Date service
  94. 94. 2015 Hany SalahEldeen Dissertation Defense 94 Carbon Dating API { "self": "http://cd.cs.odu.edu/cd?url=http://www.cnn.com", "URI": "http://www.cnn.com", "Estimated Creation Date": "1998-12-06T04:02:33", "Last Modified": "", "Bitly.com": "2008-06-08T12:00:00", "Topsy.com": "2015-01-25T23:31:42", "Backlinks": "2003-03-12T05:35:44", "Google.com": "2005-01-11T00:00:00", "Archives": [ [ "Earliest", "1998-12-06T04:02:33" ], [ "By_Archive", { "http://archive.today/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26", "http://arquivo.pt/wayback/wayback/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26", "http://wayback.vefsafn.is/wayback/20011106102722/http://www.cnn.com/": "1998-12-06T04:02:33", "http://web.archive.org/web/20131218180509/http://www.cnn.com/": "2013-12-18T18:05:09" } ] ] }
  95. 95. 2015 Hany SalahEldeen Dissertation Defense 95 Evaluation Dataset  From each we randomly selected 100 unique URLs to create our gold standard dataset
  96. 96. 2015 Hany SalahEldeen Dissertation Defense 96 Evaluation • Applied our 6 methods on 1200 resources. • Get leftmost estimate. Number of Resources Percentage An estimate found 910 76% Exact matching estimate 393 33% No estimate found 290 24% Total Resources 1200 100%
  97. 97. 2015 Hany SalahEldeen Dissertation Defense 97 Actual Vs. Estimated Dates
  98. 98. 2015 Hany SalahEldeen Dissertation Defense 98 Conclusion: We can estimate the creation date of resources correctly • Results: • Succeeded in estimating the creation date accurately in 75.90% of the resources. • Publications: 1. H. M. SalahEldeen and M. L. Nelson. Carbon dating the web: Estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web Companion, TempWeb03, WWW '13, 2013
  99. 99. 2015 Hany SalahEldeen Dissertation Defense 99 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. http://cd.cs.odu.edu/
  100. 100. 2015 Hany SalahEldeen Dissertation Defense 100 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. Yes, it’s better than mine… I admit it
  101. 101. 2015 Hany SalahEldeen Dissertation Defense 101 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation User’s Temporal Intention
  102. 102. 2015 Hany SalahEldeen Dissertation Defense 102 Problem: There is an inconsistency between what the tweet’s author intended to share at time ttweet and what the reader might actually read upon clicking on the link at time tclick .
  103. 103. 2015 Hany SalahEldeen Dissertation Defense 103 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation Detecting What is Intention and how to detect it?
  104. 104. 2015 Hany SalahEldeen Dissertation Defense 104 Amazon’s Mechanical Turk • Crowdsourcing Internet marketplace • Co-ordinates the use of human intelligence to perform tasks that computers are currently unable to do.* * http://en.wikipedia.org/wiki/Amazon_Mechanical_Turk
  105. 105. 2015 Hany SalahEldeen Dissertation Defense 105 Goal: Understand and collect user intention data via MT Tweets dataset Intention Classification Tasks User Intention Data Classifier Train
  106. 106. 2015 Hany SalahEldeen Dissertation Defense 106 Goal: Understand and collect user intention data via MT Tweets dataset Intention Classification Tasks User Intention Data Classifier Train • Problem: • It is not as easy as it seems!
  107. 107. 2015 Hany SalahEldeen Dissertation Defense 107 How NOT to classify temporal intention 101 • The tweet is presented along with the two snapshots: at ttweet at tclick
  108. 108. 2015 Hany SalahEldeen Dissertation Defense 108 And compared MT results with Experts • Experts: Manually assigning a version to each tweet via a face to face meeting with WS-DL members. • For 9 MT assignments per tweet: • If we allowed 4-5 splits we have 58% match with WS-DL. • If we allowed 3-6 splits or better we got 31% match  Which is worse than flipping a coin!
  109. 109. 2015 Hany SalahEldeen Dissertation Defense 109 Idea: We need to transform the problem from intention to relevance.
  110. 110. 2015 Hany SalahEldeen Dissertation Defense 110 Relevance tasks are simpler • MT workers are more accustomed to classification tasks and it requires minimum amount of explanation • Transform a hard problem to an easy one Is that a cat? - Yes - No
  111. 111. 2015 Hany SalahEldeen Dissertation Defense 111 Temporal Intention Relevancy Model (TIRM) Between ttweet and tclick: The linked resource could have: • Changed • Not changed The tweet and the linked resource could be: • Still relevant • No longer relevant
  112. 112. 2015 Hany SalahEldeen Dissertation Defense 112 Resource is changed but relevant • The resource changed • But it is still relevant  Intention: need the current version of the resource at any time
  113. 113. 2015 Hany SalahEldeen Dissertation Defense 113 Relevancy and Intention mapping Current
  114. 114. 2015 Hany SalahEldeen Dissertation Defense 114 Resource is changed and not relevant  Intention: need the past version of the resource at any time • The resource changed • But it is no longer relevant
  115. 115. 2015 Hany SalahEldeen Dissertation Defense 115 Relevancy and Intention mapping PastCurrent
  116. 116. 2015 Hany SalahEldeen Dissertation Defense 116 Resource is not changed and relevant  Intention: need the past version of the resource at any time • The resource is not changed • And it is relevant
  117. 117. 2015 Hany SalahEldeen Dissertation Defense 117 Relevancy and Intention mapping PastCurrent Past
  118. 118. 2015 Hany SalahEldeen Dissertation Defense 118 Resource is not changed and not relevant  Intention: I am not sure which version of the resource I need • The resource is not changed • But it is not relevant
  119. 119. 2015 Hany SalahEldeen Dissertation Defense 119 Relevancy and Intention mapping PastCurrent Past Not Sure
  120. 120. 2015 Hany SalahEldeen Dissertation Defense 120 Validation: Update the MT experiment • MT workers ≡ judgments of the experts (WS-DL members) ✓ Is the content still relevant to the tweet?
  121. 121. 2015 Hany SalahEldeen Dissertation Defense 121 Mechanical Turk Workers Vs. Experts • For 100 tweets, WS-DL members % of agreement: • Cohen’s K = 0.854  almost perfect agreement Agreement in 3-2 split or more votes 93% Agreement in 4-1 split or more votes 80% Agreement with 5-0 votes 60%
  122. 122. 2015 Hany SalahEldeen Dissertation Defense 122 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Can we model this temporal intention?
  123. 123. 2015 Hany SalahEldeen Dissertation Defense 123 Data Collection • From SNAP dataset we extracted: • Tweets in English • Each has an embedded URI pointing to an external resource. • The embedded URI is shortened via Bit.ly • The external resource: • Still persists. • Has at least 10 mementos. • Is unique.  We extracted 5,937 unique instances
  124. 124. 2015 Hany SalahEldeen Dissertation Defense 124 Time delta between the tweet and the closest mementoRandomly selected 1,124 instances Time delta range: 3.07 minutes to 56.04 hours Average: 25.79 hours ~ 1 day Tweet time After Tweet time Before Tweet time
  125. 125. 2015 Hany SalahEldeen Dissertation Defense 125 Training Dataset • Rcurrent: The state of the resource at current time. • Rclick: The state of the resource at click time. Relevant Assignments 929 82.65% Non-Relevant Assignments 195 17.35% 5 MT workers agreeing (5-0 split) 589 52.40% 4 MT workers agreeing (4-1 split) 309 27.49% 3 MT workers agreeing (3-2 close call split) 226 20.11%
  126. 126. 2015 Hany SalahEldeen Dissertation Defense 126 Training Dataset • Rcurrent: The state of the resource at current time. • Rclick: The state of the resource at click time. Relevant Assignments 929 82.65% Non-Relevant Assignments 195 17.35% 5 MT workers agreeing (5-0 split) 589 52.40% 4 MT workers agreeing (4-1 split) 309 27.49% 3 MT workers agreeing (3-2 close call split) 226 20.11%
  127. 127. 2015 Hany SalahEldeen Dissertation Defense 127 Intention modeling: Feature extraction •For each tweet we perform: • Link analysis • Social media mining • Archival existence • Sentiment analysis • Content similarity • Entity identification
  128. 128. 2015 Hany SalahEldeen Dissertation Defense 128 Training the classifier • From the feature extraction phase we extracted 39 different features to train the classifier. • Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%
  129. 129. 2015 Hany SalahEldeen Dissertation Defense 129 Most significant features sorted by information gain Rank Feature Gain Ratio 1 Existence of celebrities in tweets 0.149 2 Number of mementos 0.090 3 Tweet similarity with current page 0.071 4 Similarity: Current & past page 0.053 5 Similarity: Tweet & past page 0.044 6 Original URI’s depth 0.032
  130. 130. 2015 Hany SalahEldeen Dissertation Defense 130 Testing the model • We tested against: • The remaining 4,813 from the original 5,937 instances after extracting the 1,124 used in training. • The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, & H1N1) Dataset Status 200 Status 404 or other Relevant % Non-Relevant % Extended 4,813 instances 96.77% 3.23% 96.74% 3.26% MJ’s Death 57.54% 42.46% 93.24% 6.76% H1N1 Outbreak 8.96% 91.04% 97.48% 2.52% Iran Elections 68.21% 31.79% 94.69% 5.31% Obama’s Nobel Prize 62.86% 37.14% 93.89% 6.11% Syrian Uprising 80.80% 19.20% 70.26% 29.75%
  131. 131. 2015 Hany SalahEldeen Dissertation Defense 131 Idea: We need to transform the problem from intention to relevance. Now we need to transform it back! Recap…
  132. 132. 2015 Hany SalahEldeen Dissertation Defense 132 Recap: Relevancy and Intention mapping Past Reading the wrong history
  133. 133. 2015 Hany SalahEldeen Dissertation Defense 133 Mapping TIRM • We used 70% similarity as a threshold of relevancy. Reading the wrong history In up to 25% of the cases
  134. 134. 2015 Hany SalahEldeen Dissertation Defense 134 Conclusion: We can model users’ temporal intention accurately and efficiently • Results: • We successfully transformed the complicated problem of intention to a simpler one of relevance. • We successfully collected a gold standard dataset of temporal user intention. • We found a temporal inconsistency in the shared resource up to 25% of the cases according to the dataset. • Publications: 1. H. M. SalahEldeen and M. L. Nelson. Reading the correct history?: Modeling temporal intention in resource sharing. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013.
  135. 135. 2015 Hany SalahEldeen Dissertation Defense 135 So we modeled intention… can we make it better?
  136. 136. 2015 Hany SalahEldeen Dissertation Defense 136 Most significant features sorted by information gain Rank Feature Gain Ratio 1 Existence of celebrities in tweets 0.149 2 Number of mementos 0.090 3 Tweet similarity with current page 0.071 4 Similarity: Current & past page 0.0527 5 Similarity: Tweet & past page 0.04401 6 Original URI’s depth 0.0324
  137. 137. 2015 Hany SalahEldeen Dissertation Defense 137 Most significant features sorted by information gain Rank Feature Gain Ratio 1 Existence of celebrities in tweets 0.149 2 Number of mementos 0.090 3 Tweet similarity with current page 0.071 4 Similarity: Current & past page 0.0527 5 Similarity: Tweet & past page 0.04401 6 Original URI’s depth 0.0324
  138. 138. 2015 Hany SalahEldeen Dissertation Defense 138 Enhancing TIRM • Extending and tuning the features: • Linguistic feature analysis • Semantic similarity analysis using latent topic modeling • Dataset balancing • Feature selection and minimization
  139. 139. 2015 Hany SalahEldeen Dissertation Defense 139 A whole lot of features! 39 65 different features in extended TIRM Further details: refer to chapter 7
  140. 140. 2015 Hany SalahEldeen Dissertation Defense 140 TIRM enhancement and minimization results
  141. 141. 2015 Hany SalahEldeen Dissertation Defense 141 Point of Confusion: C Point of Certainty: S  Strongest Current Intention From binary to probabilistic strength Further details: refer to chapter 7
  142. 142. 2015 Hany SalahEldeen Dissertation Defense 142 Intention strength formulation Intention strength magnitude of the new resource: Generalization in regards of class:
  143. 143. 2015 Hany SalahEldeen Dissertation Defense 143 Intention strength across instances in dataset
  144. 144. 2015 Hany SalahEldeen Dissertation Defense 144
  145. 145. 2015 Hany SalahEldeen Dissertation Defense 145 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting Can we find a relation between the modeled intention and time …to predict it?
  146. 146. 2015 Hany SalahEldeen Dissertation Defense 146 Remember: Data Collection • From SNAP dataset we extracted: • Tweets in English • Each has an embedded URI pointing to an external resource. • The embedded URI is shortened via Bit.ly • The external resource: • Still persists. • Has at least 10 mementos. • Is unique.  We extracted 5,937 unique instances
  147. 147. 2015 Hany SalahEldeen Dissertation Defense 147 Intention strength across time time Resource = Closest memento Resource = current versionWe have 10 mementos of the resource uniformly distributed … We can calculate intention strength at every point
  148. 148. 2015 Hany SalahEldeen Dissertation Defense 148 Intention strength across time Dataset collection and calculation framework
  149. 149. 2015 Hany SalahEldeen Dissertation Defense 149 Behavior of instances in different classes time time time Intentionstrength Intentionstrength Intentionstrength Steady Current Intention Steady Past Intention
  150. 150. 2015 Hany SalahEldeen Dissertation Defense 150 Behavior of instances in different classes
  151. 151. 2015 Hany SalahEldeen Dissertation Defense 151 Given the features we already collected can we classify tweets according to their behavioral class?
  152. 152. 2015 Hany SalahEldeen Dissertation Defense 152 Classifying intention behavior across time
  153. 153. 2015 Hany SalahEldeen Dissertation Defense 153 If we can limit the features to the ones that exist before tweet time can we perform a prediction?
  154. 154. 2015 Hany SalahEldeen Dissertation Defense 154 Classifying intention behavior across time  We can perform a prediction!
  155. 155. 2015 Hany SalahEldeen Dissertation Defense 155 Intention behavior prediction classifier
  156. 156. 2015 Hany SalahEldeen Dissertation Defense 156 Conclusion: We can predict the author’s temporal intention • Results: • We can predict for the author whether the intention conveyed to the readers will be consistent or will it change with 77% accuracy. • Publications: 1. H. M. SalahEldeen and M. L. Nelson. Predicting Temporal Intention in Resource Sharing. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '15, 2015.
  157. 157. 2015 Hany SalahEldeen Dissertation Defense 157 At this time, we successfully detected, modeled and predicted User’s Temporal Intention in Shared Content
  158. 158. 2015 Hany SalahEldeen Dissertation Defense 158 Shared Resource Time User Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting UserTemporal Intention Temporal Intention Model
  159. 159. 2015 Hany SalahEldeen Dissertation Defense 159 So we built an awesome prediction model for Temporal Intention… what next?
  160. 160. 2015 Hany SalahEldeen Dissertation Defense 160 A Framework of Temporal Intention time Posted a tweet Read the tweet • Tools for authors • Enrich the archives with current content for posterity
  161. 161. 2015 Hany SalahEldeen Dissertation Defense 161 Prediction API
  162. 162. 2015 Hany SalahEldeen Dissertation Defense 162 Tools for Authors
  163. 163. 2015 Hany SalahEldeen Dissertation Defense 163 Temporal Intention Implementation time Posted a tweet Read the tweet • Tools for readers • Maintain the temporal consistence of content
  164. 164. 2015 Hany SalahEldeen Dissertation Defense 164 Tools for readers
  165. 165. 2015 Hany SalahEldeen Dissertation Defense 165 Tools for readers 1. Temporal preservation of vulnerable content 2. Version recommendation based on temporal intention estimation Target Publication: Utilizing Temporal Intention Prediction for Just-in-time Preservation and Recommendation of Vulnerable Social Media Content. WSDM 2016
  166. 166. 2015 Hany SalahEldeen Dissertation Defense 166 Motivation Background Related Research Research Question User-Time-Shared Resource Conclusions
  167. 167. 2015 Hany SalahEldeen Dissertation Defense 167 Accomplished Goals • Detect the temporal intention of the: 1. Author upon sharing time 2. The reader upon dereferencing time • Model this intention as a function of time, nature of the resource, and its context. • Predict how resources change with time and the intention behind sharing them to minimize inconsistency. • Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9
  168. 168. 2015 Hany SalahEldeen Dissertation Defense 168 Also, our work reached fame…
  169. 169. 2015 Hany SalahEldeen Dissertation Defense 169 The Virginian Pilot
  170. 170. 2015 Hany SalahEldeen Dissertation Defense 170 http://www.bbc.com/future/story/20120 927-the-decaying-web BBC.com
  171. 171. 2015 Hany SalahEldeen Dissertation Defense 171 Popular Mechanics February 2014 issue, page 20
  172. 172. 2015 Hany SalahEldeen Dissertation Defense 172 3 x MIT Technology Review http://www.technologyreview.com/view/513996/how-to-carbon-date-a-web- page/ http://www.technologyreview.com/view/519391/internet-archaeologists- reconstruct-lost-web-pages/ http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter- is-vanishing-from-the-web-say-computer-scientists/
  173. 173. 2015 Hany SalahEldeen Dissertation Defense 173 Mashable
  174. 174. 2015 Hany SalahEldeen Dissertation Defense 174 Mashable Yes I am Indiana Jones of the internet
  175. 175. 2015 Hany SalahEldeen Dissertation Defense 175 Publications Published Submitted In preparation Planned JCDL 2011 TPDL 2015 WWW 2016 IJDL 2016 TPDL 2012 SIGIR 2016 WSDM 2016 JCDL 2013 TPDL 2013 WWW 2013 DL 2014 AAAI 2015 IJDL 2015 JCDL 2015
  176. 176. 2015 Hany SalahEldeen Dissertation Defense 176 Remember Rémi Ochlik? Rémi Ochlik 16 October 1983 – 22 February 2012
  177. 177. 2015 Hany SalahEldeen Dissertation Defense 177 … and the missing content about him? Accessed in July 2014
  178. 178. 2015 Hany SalahEldeen Dissertation Defense 178 We can maintain the consistency of history Our Temporal Intention Relevancy Model

×