Carbon Dating The Web: Estimating the Age of Web Resources

2,569 views

Published on

Presentation in TempWeb 03 at WWW 2013, Rio de Janiero, Brazil

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,569
On SlideShare
0
From Embeds
0
Number of Embeds
1,225
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Carbon Dating The Web: Estimating the Age of Web Resources

  1. 1. Carbon Dating the Web:Estimating the Age of Web ResourcesHany SalahEldeen & Michael Nelson Carbon Dating the WebHany M. SalahEldeen & Michael L. NelsonOld Dominion UniversityDepartment of Computer ScienceWeb Science and Digital Libraries Lab.
  2. 2. MotivationIn our research in social media,resource sharing, and userintention a question emerged…Hany SalahEldeen & Michael Nelson 1 Carbon Dating the WebWhen did a certain resource firstappear on the web?
  3. 3. First thought: Last ModifiedResponse HeaderHany SalahEldeen & Michael Nelson 2 Carbon Dating the Web$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.htmlHTTP/1.1 200 OKContent-Type: text/html; charset=UTF-8Expires: Wed, 08 May 2013 14:18:49 GMTDate: Wed, 08 May 2013 14:18:49 GMTCache-Control: private, max-age=0Last-Modified: Wed, 08 May 2013 08:03:02 GMTETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"X-Content-Type-Options: nosniffX-XSS-Protection: 1; mode=block
  4. 4. The server responds with the lastmodified date …Hany SalahEldeen & Michael Nelson 2 Carbon Dating the Web$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.htmlHTTP/1.1 200 OKContent-Type: text/html; charset=UTF-8Expires: Wed, 08 May 2013 14:18:49 GMTDate: Wed, 08 May 2013 14:18:49 GMTCache-Control: private, max-age=0Last-Modified: Wed, 08 May 2013 08:03:02 GMTETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"X-Content-Type-Options: nosniffX-XSS-Protection: 1; mode=blockReal Creation dateCurrent Server datetimeLast modified date (Incorrect)
  5. 5. Lacks accuracyHany SalahEldeen & Michael Nelson 2 Carbon Dating the Web$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.htmlHTTP/1.1 200 OKContent-Type: text/html; charset=UTF-8Expires: Wed, 08 May 2013 14:18:49 GMTDate: Wed, 08 May 2013 14:18:49 GMTCache-Control: private, max-age=0Last-Modified: Wed, 08 May 2013 08:03:02 GMTETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"X-Content-Type-Options: nosniffX-XSS-Protection: 1; mode=blockReal Creation dateCurrent Server datetimeLast modified date (Incorrect)Problematic as it is inaccurate in a large percentage of cases.08 May 2013 ≈ 2012-02-11
  6. 6. Last modified date header is notavailableHany SalahEldeen & Michael Nelson 3 Carbon Dating the Web% curl -I http://temporalweb.net/HTTP/1.1 200 OKSet-Cookie: 60gpBAK=R1224192509; path=/; expires=Sat, 11-May-2013 03:45:10 GMTDate: Sat, 11 May 2013 02:37:55 GMTContent-Type: text/htmlConnection: keep-aliveSet-Cookie: 60gp=R152135972; path=/; expires=Sat, 11-May-2013 03:36:44 GMTServer: Apache/2.2.X (OVH)Accept-Ranges: bytesVary: Accept-EncodingSometimes it is not present in the response headers.
  7. 7. Second thought: Timestamp onthe pageHany SalahEldeen & Michael Nelson 4 Carbon Dating the Web
  8. 8. But the timestamp is highlyinconsistentHany SalahEldeen & Michael Nelson 5 Carbon Dating the Web
  9. 9. … and dependent on the page’sstyle/scheme.Hany SalahEldeen & Michael Nelson 6 Carbon Dating the Web
  10. 10. So as its location on the pageHany SalahEldeen & Michael Nelson 7 Carbon Dating the Web
  11. 11. Pages’ Timestamps DifferHany SalahEldeen & Michael Nelson 8 Carbon Dating the WebVery dependent on the page’s scheme/styleNot consistentNon-existent sometimes
  12. 12. Shortcomings of using timestampextractionHany SalahEldeen & Michael Nelson 9 Carbon Dating the Web• M. Inoue and K. Tajima. Noise robust detection of the emergence and spread of topicson the web. In Proceedings of the 2nd Temporal Web Analytics Workshop, TempWeb12, pages 9 {16, New York, NY, USA, 2012. ACMM. Inoue and K. Tajima developed a technique of extractingcreation timestamps on web pages.Shortcomings:• Ambiguity (12/07 is it the 12th of July or the 7th of December?).• Non generalizable.• Highly dependent on the specific CMS• Highly dependent on the most prominent timestamp patterns.
  13. 13. But what if the resource itselfdoesn’t exist any more?Hany SalahEldeen & Michael Nelson 10 Carbon Dating the Web
  14. 14. Third thought: First existence inpublic archivesHany SalahEldeen & Michael Nelson 11 Carbon Dating the WebTimestamp of the first memento
  15. 15. Shortcomings:Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web1- The page is not archived
  16. 16. Shortcomings:Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web2- Delay between page creation and archive’s first crawl.
  17. 17. Shortcomings:Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web3- A page is published then deleted before it is archived.
  18. 18. Shortcomings:Hany SalahEldeen & Michael Nelson 12 Carbon Dating the Web4- The archive’s quarantine (12 month- 2 weeks)
  19. 19. GoalCreate a tool that estimates withgenerality the creation date of theresource without relying onspecific infrastructuresHany SalahEldeen & Michael Nelson 13 Carbon Dating the Web
  20. 20. Target Specification• Doesn’t rely on the infrastructure of thehosting web server.• Doesn’t rely on the state and template of theresource.• Highly generic.• Fast response with no quarantine periods.• High accuracy, getting close estimates to realcreation date.Hany SalahEldeen & Michael Nelson 14 Carbon Dating the Web
  21. 21. IdeaMoving objects leave trails…Hany SalahEldeen & Michael Nelson 15 Carbon Dating the Web
  22. 22. IdeaMoving objects leave trails…Hany SalahEldeen & Michael Nelson 15 Carbon Dating the WebOr:Foo  If you were AussieChad  if you were British
  23. 23. IdeaWeb pages leave trails as well sincethe day they were created…Hany SalahEldeen & Michael Nelson 16 Carbon Dating the Web
  24. 24. Web TrailsA web page could leave a trail of one of thefollowing denoting its existence:– References– Links (anchors)– Social media likes and interactions.– URL shortening.– BacklinksHany SalahEldeen & Michael Nelson 17 Carbon Dating the Web
  25. 25. The AssumptionsWe can propose reasonable assumptions that:1. We have no prior knowledge of the resource orits hosting web server.2. The creation date and the publishing date of aresource coincide. Ex.: When you write a blog, you publish it assoon as you create it.Hany SalahEldeen & Michael Nelson 18 Carbon Dating the Web
  26. 26. IdeaThe creation date of any of the associatedevents/trails could be an estimate of thecreation date.Hany SalahEldeen & Michael Nelson 19 Carbon Dating the WebWebResource
  27. 27. ScenarioLet’s consider the following scenario, onSaturday night on the 11th of February of lastyear I wrote a blog post about my work on theresearch group’s blog page.Hany SalahEldeen & Michael Nelson 20 Carbon Dating the Web
  28. 28. After creating the post I tweetedabout it …Hany SalahEldeen & Michael Nelson 21 Carbon Dating the Webhttps://twitter.com/hanysalaheldeen/status/168704224488730625
  29. 29. Then it picked up some speed onTwitter and Facebook …Hany SalahEldeen & Michael Nelson 22 Carbon Dating the Webhttp://topsy.com/http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html
  30. 30. The timeline of the resourceHany SalahEldeen & Michael Nelson 23 Carbon Dating the Web
  31. 31. Given the events linked to theexistence of the resource we willexamine ways to extract firstobservationsHany SalahEldeen & Michael Nelson 24 Carbon Dating the Web
  32. 32. Age Estimation Methods1. Resource and server analysis.2. Backlinks analysis.a) Web page backlinks.b) Social media backlinks.3. Archiving analysis.4. Search engine indexing analysisHany SalahEldeen & Michael Nelson 25 Carbon Dating the Web
  33. 33. Resource and Server AnalysisHany SalahEldeen & Michael Nelson 26 Carbon Dating the Web$ curl -I http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.htmlHTTP/1.1 200 OKContent-Type: text/html; charset=UTF-8Expires: Wed, 08 May 2013 14:18:49 GMTDate: Wed, 08 May 2013 14:18:49 GMTCache-Control: private, max-age=0Last-Modified: Wed, 08 May 2013 08:03:02 GMTETag: "e419d850-22ae-4fe6-a0f4-8ab9477f0c0d"X-Content-Type-Options: nosniffX-XSS-Protection: 1; mode=blockExamine the server response and extract the lastmodified date from the header if exists.
  34. 34. Observations recorded:Hany SalahEldeen & Michael Nelson 27 Carbon Dating the Web1. Last modified date from the response header.
  35. 35. Age Estimation Methods1. Resource and server analysis.2. Backlinks analysis.a) Web page backlinks.b) Social media backlinks.3. Archiving analysis.4. Search engine indexing analysisHany SalahEldeen & Michael Nelson 28 Carbon Dating the Web
  36. 36. Backlinks Analysis• We use Google search API to discover backlinks of A.• B & C were created after A was created.• But this assumption is not completely true.• Page B or C could be modified later to its creation of AHany SalahEldeen & Michael Nelson 29 Carbon Dating the WebA(The resource)B C
  37. 37. Time MagazineEx.: If the front page of Time magazine decidedto finally feature me as “Person of the Year”In this case page B (Time magazine’s front page)was modified to point to my page AHany SalahEldeen & Michael Nelson 30 Carbon Dating the WebHany’sWebsiteTimeMagazine
  38. 38. When did the link first appear?To solve this problem:1. We extract the timemap of the archived mementos of B.2. Perform binary search to allocate the first appearance of thelink to A in B.3. Get the timestamp of that first memento.Hany SalahEldeen & Michael Nelson 31 Carbon Dating the WebtimeI first appeared here!
  39. 39. Observations recorded:Hany SalahEldeen & Michael Nelson 32 Carbon Dating the Web1. Last modified date from the response header.2. First Appearance of a backlink.
  40. 40. Social Media BacklinksHany SalahEldeen & Michael Nelson 33 Carbon Dating the Web• Similarly, you create a social backlink when youtweet about a page
  41. 41. Topsy Otter APIHany SalahEldeen & Michael Nelson 34 Carbon Dating the WebUpto500Tweets
  42. 42. Topsy Otter APIHany SalahEldeen & Michael Nelson 34 Carbon Dating the WebDifferent shortened versions
  43. 43. Topsy Otter APIHany SalahEldeen & Michael Nelson 34 Carbon Dating the WebBreak ties via the API epoch
  44. 44. Observations recorded:Hany SalahEldeen & Michael Nelson 35 Carbon Dating the Web1. Last modified date from the response header.2. First Appearance of a backlink.3. First Tweet published.
  45. 45. URL ShorteningHany SalahEldeen & Michael Nelson 36 Carbon Dating the Webhttp://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.htmlhttp://bit.ly/losing_revolutionExtract number of clicksCreation Date of the Bitly
  46. 46. Observations recorded:Hany SalahEldeen & Michael Nelson 37 Carbon Dating the Web1. Last modified date from the response header.2. First Appearance of a backlink.3. First Tweet published.4. First Bitly Shortened URL created.
  47. 47. Age Estimation Methods1. Resource and server analysis.2. Backlinks analysis.a) Web page backlinks.b) Social media backlinks.3. Archiving analysis.4. Search engine indexing analysisHany SalahEldeen & Michael Nelson 38 Carbon Dating the Web
  48. 48. Archives Analysis• Furthermore, if the original headers exist for the first memento we extractthe original last modified date.Hany SalahEldeen & Michael Nelson 39 Carbon Dating the WebGet timestamp of first mementoDownload the memento timemaps of the resource
  49. 49. Observations recorded:Hany SalahEldeen & Michael Nelson 40 Carbon Dating the Web1. Last modified date from the response header.2. First Appearance of a backlink.3. First Tweet published.4. First Bitly Shortened URL created.5. Time stamp of first memento in the archives.
  50. 50. Age Estimation Methods1. Resource and server analysis.2. Backlinks analysis.a) Web page backlinks.b) Social media backlinks.3. Archiving analysis.4. Search engine indexing analysisHany SalahEldeen & Michael Nelson 41 Carbon Dating the Web
  51. 51. Search Engine Index Analysis• We use Google’s search API to extract the last crawled date• Relatively short time between resource creation and search engine discovery.• Drawback: Granularity is by day not by time.Hany SalahEldeen & Michael Nelson 42 Carbon Dating the WebLast crawled dates
  52. 52. Observations recorded:Hany SalahEldeen & Michael Nelson 43 Carbon Dating the Web1. Last modified date from the response header.2. First Appearance of a backlink.3. First Tweet published.4. First Bitly Shortened URL created.5. Time stamp of first memento in the archives.6. Date of the last crawl by the search engine.
  53. 53. Ok, now we have a collection ofsources that return creation dates,what will we do next?Hany SalahEldeen & Michael Nelson 44 Carbon Dating the Web
  54. 54. Timestamps Accumulation• We collect the obtained dates and get theleftmost creation date recorded.Hany SalahEldeen & Michael Nelson 45 Carbon Dating the Web
  55. 55. Timestamps AccumulationHany SalahEldeen & Michael Nelson 46 Carbon Dating the Web
  56. 56. Next step: Verifying our methodsHany SalahEldeen & Michael Nelson 47 Carbon Dating the Web
  57. 57. Estimated Age Verification1. Collect a dataset of webpages of knowncreation/publishing date.2. Compare the estimated results from ourmethod and the actual dates recorded.Hany SalahEldeen & Michael Nelson 48 Carbon Dating the Web
  58. 58. Gold Standard Data CollectionHany SalahEldeen & Michael Nelson 49 Carbon Dating the WebWe collect the pages from 4 differencecategories of collections to ensure variation.1. News Sites.2. Social Media and Blogs.3. Long Standing Domains.4. Manual Random Extraction.
  59. 59. News SitesHany SalahEldeen & Michael Nelson 50 Carbon Dating the WebUsing RSS and Atom feeds or XML sitemaps weextracted numerous pages along with theirrespective publishing dates.1. Google News (29,154 pages)2. BBC (3,703 pages)3. CNN (18,519 pages)4. Yahoo News (34,588 pages)5. The Hollywood Gossip (6,859 pages)
  60. 60. Social SitesHany SalahEldeen & Michael Nelson 51 Carbon Dating the WebWe randomly selected different resources withno regard to popularity to avoid the inherentbias:1. Pinterest (55,463 posts)2. Tumblr (52,513 posts)3. Youtube (78,000 posts)4. Word Press (2,405,901 posts)5. Blogger (32,417 posts)
  61. 61. Long Standing DomainsHany SalahEldeen & Michael Nelson 52 Carbon Dating the Web• Extract the top 500domains fromAlexa.com• Query their DNSregistry dates.• Were able to extract167 dates.
  62. 62. Manual Random ExtractionHany SalahEldeen & Michael Nelson 53 Carbon Dating the Web• We extracted 90 different random URLsobtained from random walks on the web,visually inspected them to extract the creationdate.• The 10 URLs analyzed by Jatowt et al.** A. Jatowt, Y. Kawai, and K. Tanaka. Detecting age of page content. In Proceedings of the 9thannual ACM international workshop on Web information and data management, WIDM 07,pages 137--144, New York, NY, USA, 2007. ACM.
  63. 63. Gold Standard Data CollectionHany SalahEldeen & Michael Nelson 54 Carbon Dating the Web From each we randomly selected 100 unique URLs to create our gold standarddataset
  64. 64. EvaluationHany SalahEldeen & Michael Nelson 55 Carbon Dating the Web• Applied our 6 methods on 1200 resources.• Get leftmost estimation.Number of Resources PercentageAn estimation found 910 76%Exact matching estimation 393 33%No estimation found 290 24%Total Resources 1200 100%
  65. 65. EvaluationHany SalahEldeen & Michael Nelson 56 Carbon Dating the Web
  66. 66. Actual Vs. Estimated DatesHany SalahEldeen & Michael Nelson 57 Carbon Dating the Web
  67. 67. So what happens if one of these 6methods failed?Hany SalahEldeen & Michael Nelson 58 Carbon Dating the Web
  68. 68. Isolation and EliminationHany SalahEldeen & Michael Nelson 59 Carbon Dating the Web
  69. 69. Hany SalahEldeen & Michael Nelson 61 Carbon Dating the WebCarbon Date API
  70. 70. http://cd.cs.odu.edu/cd/<Your URL Here>Hany SalahEldeen & Michael Nelson 62 Carbon Dating the Web
  71. 71. Carbon Date API on GitHubHany SalahEldeen & Michael Nelson 63 Carbon Dating the Web• Due to the slow response we advise that youdownload the module and install it on yourmachine.• https://github.com/HanySalahEldeen/CarbonDate
  72. 72. Extra SlidesHany SalahEldeen & Michael Nelson Carbon Dating the Web
  73. 73. Without BitlyHany SalahEldeen & Michael Nelson 00 Carbon Dating the Web
  74. 74. Without GoogleHany SalahEldeen & Michael Nelson 00 Carbon Dating the Web

×