The Memento Protocol and Research Issues With Web Archiving


Published on

Michael L. Nelson

Old Dominion University
Web Science & Digital Libraries Research Group

University of Virginia Colloquium

Published in: Technology
  1. 1. The Memento Protocol and Research Issues With Web Archiving Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @phonedude_mln With: Los Alamos National Laboratory: Herbert Van de Sompel ODU: Michele C. Weigle, Hany SalahEldeen, Matthias Prellwitz, Justin Brunelle, Mat Kelly, Ahmed AlSum, Scott Ainsworth University of Virginia Colloquium 2016-09-12
  2. 2.*/ also:
  3. 3. Memento wants to make it easy to access the Web of the Past. 6
  4. 4. Memento achieves this by technically integrating the present Web and the past Web, by introducing a uniform version access capability for the Web. 7
  5. 5. Content Management Systems: • Designed to be aware of all versions of a resource; • Self-contained; • Variety of proprietary version mechanisms; • Versions interlinked using proprietary mechanisms. 8
  6. 6. World Wide Web: • Designed to forget about prior versions of a resource; • Distributed. 9
  7. 7. There are resource versions on the Web: • Content Management Systems; • Web Archives; • Transactional archives; • Search engine caches. 10
  8. 8. But the Web architecture has no way to deal with them: • Cannot talk about a resource as it used to exist; • Cannot access a prior version knowing the current one; • Cannot access the current version knowing a prior one; Current approaches are ad hoc and localized. 11
  9. 9. Memento: • Looks at the Web as a Content Management System; • Introduces the uniform capability to access versions on the Web; • Does not build new archives but leverages all systems that host versions: Web archives, Content Management Systems, Software Version Systems, etc. 12
  10. 10. Memento’s version access approach: • Is distributed: versions may exist on several servers; • Uses datetime as a global version indicator; • Is based on the primitives of the Web: resource, resource state, representation, content negotiation, link. 13
  11. 11. Since Memento’s access approach is distributed, and is based on Web primitives, it scales like the Web. 14
  12. 12. Memento’s core components: • Ability to speak about a resource as it existed in the past; • A bridge between present and past: link and content negotiation; • A bridge between past and present: link. 15
  13. 13. original resource and versions 16
  14. 14. bridge from present to past 17
  15. 15. bridge from past to present 18
  16. 16. Memento Framework 19
  17. 17. original resource gone 20
  18. 18. original resource’s server gone 21
  19. 19. original resource provides no link 22
  20. 20. Integrating Multiple Archives more info:
  21. 21. Memento wants to make it easy to access the Web of the Past…
  22. 22.
  23. 23. in 4 different web archives
  24. 24. Long Tail of Archives
  25. 25. Using Only Top-k Archives for URI Lookup Yields Good Results Even when there are 100s of archives, we only need to talk to a few. see:
  26. 26. OK, so More Archives == More Better But why care about archiving at all?
  27. 27. Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages
  28. 28. A Youtube video of a TV show where celebrities read “mean” Tweets about themselves Our social discourse is dominated by the web. Q.E.D.
  29. 29. Our scholarly record is in jeopardy… See also:
  30. 30. As is our legal record… See also:
  31. 31. And our popular culture as well…
  32. 32. Half-Life of Popular Music Youtube Videos Half life 0 3 6 9 12 15 18 0.5 1.0 Month LinearRegression Top 40 US Singles Charts Music Blogs @ The 500 Greatest Songs 0 1 2 3 4 5 6 7 8 9 10 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 Weeks MedianAbsoluteDeviation Datasets Top 40 US Singles Charts Music Blogs @ The 500 Greatest Songs Matthias Prellwitz, Michael L. Nelson, Music Video Redundancy and Half-Life in YouTube, Proceedings of TPDL 2011 Individual URLs die, but new versions arise
  33. 33. So we won’t lose every copy of “Shake It Off”… What about the grist of history?
  34. 34. On January 28 2011, three days into the fierce protests that would eventually oust the Egyptian president Hosni Mubarak, a Twitter user called Farrah posted a link to a picture that supposedly showed an armed man as he ran on a “rooftop during clashes between police and protesters in Suez”. I say supposedly, because both the tweet and the picture it linked to no longer exist. Instead they have been replaced with error messages that claim the message – and its contents – “doesn’t exist”.
  35. 35. Missing Tweet & Pic
  36. 36. In May 2013, both are “archived” by
  37. 37. In February 2015, they’re completely missing.
  38. 38. In 2016, redirecting…
  39. 39. …to a random (?) page
  40. 40. No Server == No HTTP Event == Nothing to Archive
  41. 41. Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012. Hany SalahEldeen, Michael L. Nelson, Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web, Proceedings of TPDL 2013. Missing: 11% year 1, 7%/year afterwards Archived: 7% year 1, 15%/year afterwards
  42. 42. Why we need multiple, independent archives…
  43. 43. A single archive is vulnerable
  44. 44. Houston, Tranquility Base Here. The Eagle has landed. see also:
  45. 45.
  46. 46. $ curl –I " got-three-grindr-dates-in-an-hour-in-the-olympic-village.html" HTTP/1.1 301 Moved Permanently Access-Control-Allow-Origin: * Age: 0 Cache-Control: max-age=60 Content-Type: text/html; charset=iso-8859-1 Date: Thu, 18 Aug 2016 01:13:46 GMT Location: note-from-the-editors.html RealAge: 0 Server: Apache Vary: Accept-Encoding, User-Agent Via: 1.1 varnish X-BackEnd: default X-Cache: MISS X-Cacheable: YES X-Restarts: 0 X-UA-Device: pc X-Varnish: 995407903 Connection: keep-alive
  47. 47. But who pays for those extra archives? 1TB endowment = ~$4700: see also:
  48. 48. Archives aren’t magic web sites They’re just web sites. If you used Mummify, you’re now left with a bunch of defunct, shortened links like:
  49. 49. Don’t Throw Away the Original URL – Use Robust Links! <a href="" data-versionurl="" data-versiondate="2015-01-21"> my robust link to the live web</a> <a href="" data-originalurl="" data-versiondate="2015-01-21"> my robust link to an archived version</a> <!DOCTYPE html> <html lang="en" itemscope itemtype="" itemid=""> <head> <meta charset="utf-8" /> <meta itemprop="dateModified" content="2015-02-02"> <meta itemprop="datePublished" content="2015-01-23"> <title>Page Level Metadata Is The Least You Can Do</title> More examples / scenarios at:
  50. 50. Economics Working Against Archives “In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.” --David Rosenthal
  51. 51. “We’ll use the cloud!”
  52. 52. "...when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale.”
  53. 53. Historicity of Web Archives
  54. 54. Malaysia Airlines Flight 17 (MH17)
  55. 55. (not really archived as well as you think)
  56. 56. Ed and I Discuss Who Has What…
  57. 57. Remember MH17?
  58. 58. Alex is now 404. Would multiple archives have convinced him?
  59. 59. Do we really have “a perfect tool to produce `evidence’ of any kind”?
  60. 60. @gary4205 mansplains to @AstroKatie see also:
  61. 61. But can you prove he didn’t say this?
  62. 62. Or that she didn’t say this? (remember: black hats can use tools created by white hats)
  63. 63. Assessing the Quality of Web Archiving "Hooray! It's in the archive!" vs. "How well was it archived?" current: the question we should be asking:
  64. 64. Temporal Drift August 27, 2005 11:16 a.m. EDT link
  65. 65. Temporal Drift: Now 3 Hours in the Past August 27, 2005 11:16 a.m. EDT link August 27, 2005 8:00 a.m. EDT link
  66. 66. Temporal Drift: Now 17 Days in the Future August 27, 2005 11:16 a.m. EDT link August 27, 2005 8:00 a.m. EDT link September 13, 2005 8:12 a.m. EDT link
  67. 67. Temporal Drift: Now 23 (or 6) Days in the Future August 27, 2005 11:16 a.m. EDT link August 27, 2005 8:00 a.m. EDT link September 13, 2005 8:12 a.m. EDT link September 19, 2005 8:25 a.m. EDT link 10+ clicks in the archive results in median drift of ~45 days (standard UI) or ~15 days with Memento. ~2% of the sessions have drift of > 1 year. see:
  68. 68. Sometimes the Live Web "Leaks" Into the Archive…
  69. 69. see: Sept 3, 2008 2012
  70. 70. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources JCDL 2014
  71. 71. M = 0.17 D = 0.09 (live web) M = 0.24 D = 0.41 (missing main) M = 0.29 D = 0.36 (missing logo + navigation) Synthetic Damage: Removing Images From damage (D) differs from % missing (M)!
  72. 72. Was missing resource important? <img>and <embed> can leave hints about size and centrality. For CSS, we look at the distribution of background color in page divided into vertical thirds.
  73. 73. Weights from Turker Assessment of Damage first: establish that Turkers can determine damaged vs. undamaged pages (81% of the time) second: find weights that match Turker's rankings of (real) differently damaged versions of the same page
  74. 74. Good News: Although %Missing (M) is steady/increasing, weighted Damage (D) is decreasing
  75. 75. A Framework for Evaluation of Composite Memento Temporal Coherence Hypertext 2015
  76. 76. As Presented by IA (now 404, but that's a different story…)
  77. 77. Not Everything Is 200412091900926 + 9 months
  78. 78. 1 in 20 pages complete; 1 in 5 have violations Description Closest Single Archive Closest Multi- Archive Bracket Single Archive Bracket Multi- Archive Completeness Mean complete 76.1% 80.2% 76.2% 80.3% Mean missing 23.9% 19.8% 23.8% 19.7% Temporal Coherence Mean prima facie coherent 41.0% 40.9% 54.7% 54.6% Mean possibly coherent 27.3% 27.3% 12.8% 14.2% Mean probably violative 2.5% 5.3% 2.5% 5.3% Mean prima facie violative 5.3% 5.3% 6.2% 6.2% At least 5% of pages can be shown to be temporal violations
  79. 79. Closing Observations
  80. 80. Wrong Metaphor for Web Archives
  81. 81. Web Archives Are Not Destinations This is a destination. This is not a destination. Memento is about linking the past and present web
  82. 82. Possible Metaphor for Viewing Past & Present?
  83. 83. Turn Archiving Into A Social Activity… see also:, Marshall & Shipman, JCDL 2011
  84. 84. …But Don't Use the "A" Word Ed: Are there any zombies out there? Shaun: Don't say that! Ed: What? Shaun: That. Ed: What? Shaun: That. The Z word. Don't say it. Ed: Why not? Shaun: Because it's ridiculous! — Shaun of the Dead
  85. 85. Pinterest: Anonymous Mementos is a memento of: but there is no machine-readable indication of this relationship repins are by-reference
  86. 86. When all else fails, justify project with: “web archiving is Big Data”
  87. 87. Backup Slides
  88. 88. Archiving your internal stuff: Transactional Archiving Never miss an update; archive your site as it is being viewed by users.
  89. 89. Archiving your internal stuff: Heritrix & Wayback Crawling your intranet: Crawling JS “stuff” will take 5X more storage: mementos of Mitre Intranet “MiiTube” – Complete With Javascript leakage
  90. 90. JavaScript == the new deep web; use ResourceSync to make sure your URIs are exposed <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="" xmlns:rs=""> <rs:ln rel="up" href=""/> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z" completed="2013-01-03T09:01:00Z"/> <url> <loc></loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> <loc></loc> <lastmod>2013-01-02T14:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784" length="14599" type="application/pdf"/> </url> </urlset> (AKA “Fancy SiteMaps”)
  91. 91. e.g., in six different archives…
  92. 92. Seagal’s Law A man with a watch knows what time it is. A man with two watches is never sure. How to resolve conflicting archives? Personalization, GeoIP, mobile vs. desktop, etc. means “the” page rarely exists, only “a” page. Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, A Method for Identifying Personalized Representations in Web Archives, D-Lib Magazine, 19(11/12), 2013.
  93. 93. Thoughtful analysis: Snarky analysis:
  94. 94. Why Care About The Past? From an anonymous WWW 2010 reviewer about our Memento paper (emphasis mine): "Is there any statistics to show that many or a good number of Web users would like to get obsolete data or resources? " one answer: replay of contemporary pages >> summary pages
  95. 95. vs.
  96. 96. Archiving Moves At Hurricane Speed, Most News Stories Move Faster
  97. 97. Most of the Story, at Least as Conveyed by, is Missing… in this case, you can reconstruct the events with
  98. 98. How Much of The Web Is Archived?
  99. 99. Public Archives, ca. Late 2010 / Early 2011 Three categories of archives • Internet Archive • Search engine • Other archives UK US See also:
  100. 100. 1000 URIs Ordered by First Observation Date See also:
  101. 101. see also:
  102. 102. How Much of the Web is Archived? It Depends on Which Web… Including SE cache Excluding SE Cache 90% 79% 97% 68% 35% 16% 88% 19% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%
  103. 103. Quis Archiviet Ipsos Archives? (thanks to for this example)
  104. 104. % curl -I HTTP/1.1 302 Found Server: nginx Date: Tue, 03 Sep 2013 00:15:14 GMT Content-Type: text/html; charset=utf-8 Connection: keep-alive Status: 302 Found Location: X-UA-Compatible: IE=Edge,chrome=1 Cache-Control: no-cache X-Request-Id: bd7caae039d6312c0542cb4ad62f3847 X-Runtime: 0.005474 X-Rack-Cache: miss current page for:
  105. 105. version of:
  106. 106. archived version of version
  107. 107. archived version of version of version