Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

376 views

Published on

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

Michael L. Nelson

Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln

With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

  1. 1. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Blockchain Can Not Be Used To Verify Replayed Archived Web Pages Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @phonedude_mln With: ODU: Michele C. Weigle, Mohamed Aturban Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein Supported in part by The Andrew Mellon Foundation and the National Science Foundation
  2. 2. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
  3. 3. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps “…right now you can get timestamps for every book, movie, song, computer program, legal document, etc. in the thousands of collections in the archive. In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshots…”
  4. 4. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL TL;DR Web archiving is not file backup. Backup = prevent, detect, repair changes Web archiving = continuous changes to replicate the past Naïve fixity techniques are not applicable for web archiving. Since 3rd party audits are not feasible, as web archives proliferate verifying web archives will centralize / federate.
  5. 5. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL A simplified workflow of web archiving $ wget WARC/1.0 WARC-Type: warcinfo Content-Type: application/warc-fields WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7- 4f03-9de2-e578d105d3cb> WARC-Filename: foo.warc.warc.gz WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E Content-Length: 257 software: Wget/1.15 (linux-gnu) format: WARC File Format 1.0 [much deletia] 1) live web site https://climate.nasa.gov/vital-signs/carbon-dioxide/ 2) Crawled by any of several archival crawlers 3) Result stored in a WARC File (like tar or zip, but for Web archives) 4) WARC files are indexed, served by replay software (there are several variations of Wayback Machine) 5) User chooses date of capture (Memento-Datetime) 6) Page replayed with banner, rewritten links, etc.
  6. 6. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL (apologies to Peter Arnett) “In order to save the page, we had to completely change it” Yes, some archives (including most versions of Wayback) provide “raw” access, but modifications can still happen (how/why is beyond the scope of this presentation).
  7. 7. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL I’ve got mad HTML skillz https://www.cs.odu.edu/~mln/travel.html
  8. 8. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html Archival Metadata The banner tells the user the original URL, which archive the page resides in, when it was archived, how many copies, etc. Links are rewritten to point back into the archive, not the live web.
  9. 9. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html $ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5 <body bgcolor=white> <pre> -January 31-February 1, 2019, NYC, ACM Publications Board Meeting $ curl -s https://www.cs.odu.edu/~mln/travel.html | wc 585 2361 26471
  10. 10. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html $ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5 <body bgcolor=white> <pre> -January 31-February 1, 2019, NYC, ACM Publications Board Meeting $ curl -s https://www.cs.odu.edu/~mln/travel.html | wc 585 2361 26471 $ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | head -5 <script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script> <script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var v=archive_analytics.values;v.service='wb';v.server_name='wwwb- app40.us.archive.org’;v.server_ms=208;archive_analytics.send_pageview({});});</script> <script type="text/javascript" src="/static/js/ait-client-rewrite.js?v=1538596186.0" charset="utf- 8"></script> <script type="text/javascript"> WB_wombat_Init('https://web.archive.org/web', '20181104174441', 'www.cs.odu.edu'); $ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | wc 618 2472 33787
  11. 11. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at archive.today http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
  12. 12. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at archive.today http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html $ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | head -5 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html style="background- color:#EEEEEE" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#" itemscope itemtype="http://schema.org/Article"><!--174.109.72.208--><!--curl/7.30.0--><head><meta http- equiv="Content-Type" content="text/html;charset=utf-8"/><meta name="robots" content="index,noarchive"/><meta name="viewport" content="device-width=300, initial-scale=1"/><meta property="twitter:card" content="summary"/><meta property="twitter:site" content="@archiveis"/><meta property="og:type" content="article"/><meta property="og:site_name" content="archive.is"/><meta property="og:url" content="http://archive.is/l6QdV" itemprop="url"/><meta property="og:title" content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:title" content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:description" content="archived 4 Nov 2018 17:46:33 UTC" itemprop="description"/><meta property="article:published_time" content="2018-11-04T17:46:33Z" itemprop="dateCreated"/><meta property="article:modified_time" content="2018-11-04T17:46:33Z" itemprop="dateModified"/><link rel="image_src" href="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="og:image" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png" itemprop="image"/><meta property="twitter:image" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="twitter:image:src" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="twitter:image:width" content="1024"/><meta property="twitter:image:height" content="768"/><link rel="icon" href="//www.google.com/s2/favicons?domain=www.cs.odu.edu"/><link rel="canonical" href="https://archive.is/l6QdV"/><link rel="bookmark" href="http://archive.today/20181104174633/https://www.cs.odu.edu/~mln/travel.html"/><title></title>< /head><body style="margin:0;background-color:#EEEEEE"><center><div id="HEADER" style="font- family:sans-serif;background [much deletia – you get the point] $ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | wc 730 3640 62392
  13. 13. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL If we just had isolated, static pages (e.g., individual jpegs, pdfs, mp3s) then there’d be no problem. But HTML has: 1) links, 2) embedded resources (including iframes), and 3) Javascript, which can modify the HTML. And HTTP has no “bulk download”, so you can’t grab an entire site instantaneously.
  14. 14. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL WARC/1.0 WARC-Type: warcinfo Content-Type: application/warc-fields WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Filename: climate.nasa.gov.warc.gz WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E Content-Length: 257 software: Wget/1.15 (linux-gnu) format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf robots: classic wget-arguments: "--warc-file=climate.nasa.gov" "https://climate.nasa.gov/vital-signs/carbon-dioxide/" WARC/1.0 WARC-Type: request WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Content-Type: application/http;msgtype=request WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638> WARC-IP-Address: 54.230.195.16 WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Content-Length: 141 GET /vital-signs/carbon-dioxide/ HTTP/1.1 User-Agent: Wget/1.15 (linux-gnu) Accept: */* Host: climate.nasa.gov Connection: Keep-Alive WARC/1.0 WARC-Type: response WARC-Record-ID: <urn:uuid:5d8861ef-93c5-4d9c-87b8-4f427f963f7c> WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Concurrent-To: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638> WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ WARC-Date: 2018-11-03T17:20:02Z We could hash the WARC file $ md5sum climate.nasa.gov.warc.gz 652853fe1bc8cb273cdf73aad8a489ca climate.nasa.gov.warc.gz But this nasa.gov page contains: •201 images •19 Javascript files •3 CSS files At a large archive like IA they could be in multiple WARC files; worst case is 224 WARC files.
  15. 15. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL We can detect changes in the root HTML https://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  16. 16. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL But what if the change is in an embedded resource?
  17. 17. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Clearly we need to render the entire page, then compute the hash. Unfortunately, that’s not easy.
  18. 18. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Load the archived page, get an eagle https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  19. 19. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Hit “reload”, get a tiger https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  20. 20. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Hit “reload” again, get a mountain https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  21. 21. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL “Look on my Javascript, ye Mighty, and despair!”
  22. 22. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Actually, the fws.gov example was super easy; most changes are much harder to trace. Mohamed Aturban, unpublished, memento: http://web.archive.org/web/20130724144801/http://www.cnn.com/ Animated GIF: https://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html
  23. 23. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Temporal violations: reconstructing pages that never existed on the live web (examples below are transient; sometimes you get the 1st image, sometimes the 2nd image) embedded in umich.edu memento, archived in perma.cc 2nd image is compressed (12209 vs. 19448 bytes); 2nd image modified in 2017-03, but replayed in a 2017-01 page embedded in copybogger.com memento, archived in archive.org 2nd image modified in 2017-12, but replayed in a 2017-11 page; blackout for privacy Temporal violations: https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
  24. 24. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL 1 WARC file, 2 Wayback Machines, 3 Browsers = 6 different replays http://wayback.archive-it.org/all/20130106140348/http://www.harvard.edu/ http://web.archive.org/web/20130106140348/http://www.harvard.edu/ see also. https://ws-dl.blogspot.com/2016/12/2016-12-20-archiving-pages-with.html
  25. 25. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Archive URIMs With at least two hashes ---------------------------------------------------------------- webharvest.gov 712 712 (100%) archive.is 1396 1364 (97.70%) vefsafn.is 1589 739 (46.50%) archive-it.org 1383 815 (58.92%) stanford.edu 1222 831 (68.00%) internetmemory.org 979 979 (100%) nationalarchives.gov.uk 994 972 (97.78%) archive.bibalex.org 199 177 (88.94%) bac-lac.gc.ca 351 351 (100%) proni.gov.uk 469 129 (27.50%) www.webarchive.org.uk 349 329 (94.26%) www.webcitation.org 1585 828 (52.23%) veebiarhiiv.digar.ee 488 308 (63.11%) webarchive.loc.gov 1594 526 (32.99%) arquivo.pt 1569 1563 (99.61%) web.archive.org 1566 1334 (85.18%) perma-archives.org 182 180 (98.90%) ---------------------------------------------------------------- 16627 12137 (72.99%) Data from 35 downloads over an 11 month period (2017-11 – 2018-10), Mohamed Aturban (in preparation) (apologies to Heraclitus) You cannot replay twice the same archived page
  26. 26. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL A metaphor for replaying archived web pages https://www.youtube.com/watch?v=ekO3Z3XWa0Q https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
  27. 27. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL King of Swamp Castle: live web/ground truth Guard: archival replay https://www.youtube.com/watch?v=ekO3Z3XWa0Q https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail $ echo "Make sure the prince doesn't leave this room until I come and get him." | md5 57facbb2734d36cb823f4230cc07b888 $ echo "Not to leave the room even if you come and get him." | md5 3ba0a2359d63f43cbe9e11fb5a179b8d $ echo "Until you come and get him, we're not to enter the room." | md5 ade3539aaa8a6d8724193e9a37f3ca6d $ echo "We don't need to do anything apart from just stop him entering the room." | md5 ea812f5b997aa42a8f293bd1ee536fd0 $ echo "Oh yes, we'll keep him in here, obviously. But if he had to leave, and we went with him..." | md5 55d184b77d99eed6367535ef3c05d7aa $ echo "Oh, yes of course. I thought you meant him! You know it seemed a bit daft me having to guard him when he's a guard." | md5
  28. 28. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Archival replay & blockchain: building a castle in a swamp • Fixity checks only work when it’s clear what to hash – Hash only the root HTML and modifications are possible via embedded resources (false negatives) – Recursively hash all embedded resources and you’ll rarely get the same hash (false positives) • There is increasing incentive to attack existing archives and create networks of fake archives – http://bit.ly/Weaponized-Web-Archives • We are investigating archive-aware hashing functions – Vagaries of replay won’t be “fixed”; we need to adapt our hashing • Central vs. distributed? – Ask yourself: “is this a winner-take-all market?” (hello, FANG) – https://blog.dshr.org/2018/09/blockchain-solves-preservation.html

×