CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Blockchain Can Not Be Used To Verify
Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
Supported in part by The Andrew Mellon Foundation
and the National Science Foundation
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
This is not what you think it is…
https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
This is not what you think it is…
https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
“…right now you can get timestamps for every book,
movie, song, computer program, legal document,
etc. in the thousands of collections in the archive.
In the future we hope to be able to work with the
Internet Archive to extend this to timestamping
website snapshots…”
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
TL;DR
Web archiving is not file backup.
Backup = prevent, detect, repair changes
Web archiving = continuous changes to replicate the past
Naïve fixity techniques are
not applicable for web archiving.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Monitoring Fixity To Detect Tampering ==
Endless False Positives
https://www.theatlantic.com/business/archive/2016/05/car-alarms-dont-work-why-so-common/482769/
https://auto.howstuffworks.com/car-driving-safety/safety-regulatory-devices/whats-point-car-alarms-nobody-calls-cops.htm
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
A simplified workflow of web archiving
$ wget
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2018-11-03T17:20:02Z
WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-
4f03-9de2-e578d105d3cb>
WARC-Filename: foo.warc.warc.gz
WARC-Block-Digest:
sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E
Content-Length: 257
software: Wget/1.15 (linux-gnu)
format: WARC File Format 1.0
[much deletia]
1) live web site
https://climate.nasa.gov/vital-signs/carbon-dioxide/
2) Crawled by any of
several archival crawlers 3) Result stored in a WARC File
(like tar or zip, but for Web archives)
4) WARC files are indexed,
served by replay software
(there are several variations of
Wayback Machine)
5) User chooses date of
capture (Memento-Datetime)
6) Page replayed with banner,
rewritten links, etc.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
(apologies to Peter Arnett)
“In order to save the page,
we had to completely change it”
Yes, some archives (including most versions of Wayback) provide “raw” access,
but modifications can still happen (how/why is beyond the scope of this presentation).
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
I’ve got mad HTML skillz
https://www.cs.odu.edu/~mln/travel.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Same page, archived at IA
https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html
Archival Metadata
The banner tells the user the original URL,
which archive the page resides in,
when it was archived, how many copies, etc.
Links are rewritten to point back
into the archive, not the live web.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Same page, archived at IA
https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html
$ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5
<body bgcolor=white>
<pre>
-January 31-February 1, 2019, NYC, ACM Publications Board Meeting
$ curl -s https://www.cs.odu.edu/~mln/travel.html | wc
585 2361 26471
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Same page, archived at IA
https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html
$ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5
<body bgcolor=white>
<pre>
-January 31-February 1, 2019, NYC, ACM Publications Board Meeting
$ curl -s https://www.cs.odu.edu/~mln/travel.html | wc
585 2361 26471
$ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | head -5
<script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script>
<script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var
v=archive_analytics.values;v.service='wb';v.server_name='wwwb-
app40.us.archive.org’;v.server_ms=208;archive_analytics.send_pageview({});});</script>
<script type="text/javascript" src="/static/js/ait-client-rewrite.js?v=1538596186.0" charset="utf-
8"></script>
<script type="text/javascript">
WB_wombat_Init('https://web.archive.org/web', '20181104174441', 'www.cs.odu.edu');
$ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | wc
618 2472 33787
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Same page, archived at archive.today
http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Same page, archived at archive.today
http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
$ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | head -5
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html style="background-
color:#EEEEEE" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#" itemscope
itemtype="http://schema.org/Article"><!--174.109.72.208--><!--curl/7.30.0--><head><meta http-
equiv="Content-Type" content="text/html;charset=utf-8"/><meta name="robots"
content="index,noarchive"/><meta name="viewport" content="device-width=300, initial-scale=1"/><meta
property="twitter:card" content="summary"/><meta property="twitter:site" content="@archiveis"/><meta
property="og:type" content="article"/><meta property="og:site_name" content="archive.is"/><meta
property="og:url" content="http://archive.is/l6QdV" itemprop="url"/><meta property="og:title"
content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:title"
content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:description"
content="archived 4 Nov 2018 17:46:33 UTC" itemprop="description"/><meta
property="article:published_time" content="2018-11-04T17:46:33Z" itemprop="dateCreated"/><meta
property="article:modified_time" content="2018-11-04T17:46:33Z" itemprop="dateModified"/><link
rel="image_src"
href="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta
property="og:image"
content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"
itemprop="image"/><meta property="twitter:image"
content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta
property="twitter:image:src"
content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta
property="twitter:image:width" content="1024"/><meta property="twitter:image:height"
content="768"/><link rel="icon" href="//www.google.com/s2/favicons?domain=www.cs.odu.edu"/><link
rel="canonical" href="https://archive.is/l6QdV"/><link rel="bookmark"
href="http://archive.today/20181104174633/https://www.cs.odu.edu/~mln/travel.html"/><title></title><
/head><body style="margin:0;background-color:#EEEEEE"><center><div id="HEADER" style="font-
family:sans-serif;background
[much deletia – you get the point]
$ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | wc
730 3640 62392
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
If we just had isolated, static pages
(e.g., individual jpegs, pdfs, mp3s)
then there’d be no problem.
But HTML has:
1) links,
2) embedded resources (including iframes), and
3) Javascript, which can modify the HTML.
And HTTP has no “bulk download”,
so you can’t grab an entire site instantaneously.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2018-11-03T17:20:02Z
WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
WARC-Filename: climate.nasa.gov.warc.gz
WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E
Content-Length: 257
software: Wget/1.15 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
robots: classic
wget-arguments: "--warc-file=climate.nasa.gov" "https://climate.nasa.gov/vital-signs/carbon-dioxide/"
WARC/1.0
WARC-Type: request
WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Content-Type: application/http;msgtype=request
WARC-Date: 2018-11-03T17:20:02Z
WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
WARC-IP-Address: 54.230.195.16
WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Content-Length: 141
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:5d8861ef-93c5-4d9c-87b8-4f427f963f7c>
WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
WARC-Concurrent-To: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
WARC-Date: 2018-11-03T17:20:02Z
We could hash the
WARC file
$ md5sum climate.nasa.gov.warc.gz
652853fe1bc8cb273cdf73aad8a489ca climate.nasa.gov.warc.gz
But this nasa.gov page contains:
•201 images
•19 Javascript files
•3 CSS files
At a large archive like IA they
could be in multiple WARC files;
worst case is 224 WARC files.
In general, the WARC file(s) corresponding
to the replayed page will be unavailable to
the user replaying the page.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
We can detect changes in the root HTML
https://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
But what if the change is in
an embedded resource?
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Clearly we need to render the entire page,
then compute the hash.
Unfortunately, that’s not easy.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Load the archived page, get an eagle
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Hit “reload”, get a tiger
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Hit “reload” again, get a mountain
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
“Look on my Javascript, ye Mighty, and despair!”
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Actually, the fws.gov example was super easy;
most changes are much harder to trace.
Mohamed Aturban, unpublished, memento:
http://web.archive.org/web/20130724144801/http://www.cnn.com/
Animated GIF: https://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Temporal violations:
reconstructing pages that never existed on the live web
(examples below are transient; sometimes you get the 1st
image, sometimes the 2nd
image)
embedded in umich.edu memento, archived in perma.cc
2nd
image is compressed (12209 vs. 19448 bytes); 2nd
image modified in 2017-03, but replayed in a 2017-01 page
embedded in copybogger.com memento, archived in archive.org
2nd
image modified in 2017-12, but replayed in a 2017-11 page; blackout for privacy
Temporal violations: https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
1 WARC file, 2 Wayback Machines, 3 Browsers
= 6 different replays
http://wayback.archive-it.org/all/20130106140348/http://www.harvard.edu/
http://web.archive.org/web/20130106140348/http://www.harvard.edu/
see also. https://ws-dl.blogspot.com/2016/12/2016-12-20-archiving-pages-with.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Experiment Design
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Archive URI-Ms
-----------------------------
perma-archives.org 182
bibalex.org 199
webarchive.org.uk 349
bac-lac.gc.ca 351
proni.gov.uk 469
digar.ee 488
webharvest.gov 712
internetmemory.org 979
nationalarchives.gov.uk 994
stanford.edu 1222
archive-it.org 1383
archive.is 1396
web.archive.org 1566
arquivo.pt 1569
webcitation.org 1585
vefsafn.is 1589
loc.gov 1594
-----------------------------
Total 16627
Sample 16k+ Mementos from 17 Web Archives
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Periodically Replay Each Archived Page
Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/
35 times, from Nov. 2017 – Oct. 2018
For each replay, we download both the rewritten version and the “raw” version (where possible).
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Periodically Replay Each Archived Page
Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/
35 times, from Nov. 2017 – Oct. 2018
For each replay, we download both the rewritten version and the “raw” version (where possible).
Partial archive outage because
of security / maintenance upgrade
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Periodically Replay Each Archived Page
Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/
35 times, from Nov. 2017 – Oct. 2018
For each replay, we download both the rewritten version and the “raw” version (where possible).
Post-upgrade, replay is variable.
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
In 11 months,
11% of the URLs Disappeared or Changed
820 were renamed & required manual rediscovery
979 disappeared & have not yet been rediscovered
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
europarchive.org became internetmemory.org
URI-Ms like this:
collection.europarchive.org/nli/20130117165443/http://bbc.co.uk/news/
changed domains and became like this:
collections.internetmemory.org/nli/20130117165443/http://bbc.co.uk/news/
europarchive.org
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
europarchive.org is now spam
internetmemory.org is now down
979 pages lost
curl –I collection.europarchive.org/nli/20130117165443/http:/bbc.co.uk/news/
HTTP/1.1 301 Moved Permanently
Date: Mon, 10 Dec 2018 04:30:50 GMT
Server: Apache
Expires: Mon, 10 Dec 2018 05:30:50 GMT
Cache-Control: max-age=3600
Location: http://europarchive.org
Connection: close
Content-Type: text/html; charset=UTF-8
curl -I collections.internetmemory.org/nli/20130117165443/http://bbc.co.uk/news/
HTTP/1.1 403 Forbidden
Date: Mon, 10 Dec 2018 04:31:51 GMT
Server: Varnish
X-Varnish: 71167297
Content-Type: text/html; charset=utf-8
Retry-After: 5
Content-Length: 252
Connection: keep-alive
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Telling humans your domain is about to change
is nice, but please tell robots too…
https://web.archive.org/web/20180104021440/http://europarchive.org/
See: https://tools.ietf.org/id/draft-wilde-sunset-header-03.html
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
webarchive.proni.gov.uk now uses Archive-It
The top-level site
webarchive.proni.gov.uk still exists,
but deep links to URI-Ms are now 404.
469 pages required manual
rediscovery.
curl -I http://webarchive.proni.gov.uk/20111215021058/http://women.sohu.com/
HTTP/1.1 404 Not Found
Date: Mon, 10 Dec 2018 05:15:56 GMT
Server: Apache/2.4.18 (Ubuntu)
Content-Type: text/html; charset=iso-8859-1
curl -I https://wayback.archive-it.org/11112/20111215021058/http://women.sohu.com/
HTTP/1.1 200 OK
...
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
www.collectionscanada.gc.ca became
webarchive.bac-lac.gc.ca:8080
(no, really – port 8080)
And deep links to URI-Ms now redirect to the top of the new site,
which means 351 pages required manual rediscovery:
$ curl -IL http://www.collectionscanada.gc.ca/webarchives/20061027192435/http://www.state.gov/
HTTP/1.0 302 Found
Location: http://www.bac-lac.gc.ca/eng/discover/archives-web-government/Pages/web-archives.aspx
...
HTTP/1.1 302 Found
Location: http://webarchive.bac-lac.gc.ca/?lang=en
Date: Mon, 10 Dec 2018 04:46:24 GMT
...
HTTP/1.1 200
Date: Mon, 10 Dec 2018 04:46:24 GMT
...
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
URI-Ms with at
Archive Name URI-Ms least two hashes
------------------------------------------------------
webarchive.loc.gov 1,594 1,235 (77.47%)
vefsafn.is 1,589 1,133 (71.30%)
webcitation.org 1,585 981 (61.89%)
arquivo.pt 1,569 1,563 (99.61%)
archive.org 1,566 1,430 (91.31%)
archive.is 1,396 1,364 (97.70%)
archive-it.org 1,383 1,383 (100%)
swap.stanford.edu 1,222 1,005 (82.24%)
nationalarchives.gov.uk 994 978 (98.39%)
internetmemory.org 979 979 (100%)
webharvest.gov 712 712 (100%)
digar.ee 488 308 (63.11%)
proni.gov.uk 469 469 (100%)
bac-lac.gc.ca 351 351 (100.0%)
webarchive.org.uk 349 348 (99.71%)
archive.bibalex.org 199 199 (100%)
perma-archives.org 182 182 (100%)
------------------------------------------------------
total 16,627 14,620 (87.92%)
You cannot replay twice the same archived page
(apologies to Heraclitus)
7 out of 8 pages produced > 1 hash over 11 months
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
More Archived Pages Changed Every Time
Than Never Changed
Never changed:
2007 URI-Ms (1 in 8)
Always changed:
2773 URI-Ms (1 in 6)
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
A metaphor for replaying archived web pages
https://www.youtube.com/watch?v=ekO3Z3XWa0Q
https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
King of Swamp Castle: live web/ground truth
Guard: archival replay
https://www.youtube.com/watch?v=ekO3Z3XWa0Q
https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
$ echo "Make sure the prince doesn't leave this room until I come and get him." | md5
57facbb2734d36cb823f4230cc07b888
$ echo "Not to leave the room even if you come and get him." | md5
3ba0a2359d63f43cbe9e11fb5a179b8d
$ echo "Until you come and get him, we're not to enter the room." | md5
ade3539aaa8a6d8724193e9a37f3ca6d
$ echo "We don't need to do anything apart from just stop him entering the room." | md5
ea812f5b997aa42a8f293bd1ee536fd0
$ echo "Oh yes, we'll keep him in here, obviously. But if he had to leave, and we went with him..." | md5
55d184b77d99eed6367535ef3c05d7aa
$ echo "Oh, yes of course. I thought you meant him! You know it seemed a bit daft me having to guard him
when he's a guard." | md5
CNI Fall 2018 Membership Meeting, 2018-12-11,
@phonedude_mln, @WebSciDL
Archival replay & blockchain:
building a castle in a swamp
• Fixity checks only work when it’s clear what to hash
– Hash only the root HTML and modifications are possible via embedded
resources (false negatives)
– Recursively hash all embedded resources and you’ll rarely get the
same hash (false positives)
• Replay is working as designed, it’s not something that will be
“fixed”
– we need server-side support for auditing, and archive-aware hashing
functions
• There is increasing incentive to attack existing archives and
create networks of fake archives
– http://bit.ly/Weaponized-Web-Archives

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

  • 1.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Blockchain Can Not Be Used To Verify Replayed Archived Web Pages Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @phonedude_mln With: ODU: Michele C. Weigle, Mohamed Aturban Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein Supported in part by The Andrew Mellon Foundation and the National Science Foundation
  • 2.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
  • 3.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps “…right now you can get timestamps for every book, movie, song, computer program, legal document, etc. in the thousands of collections in the archive. In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshots…”
  • 4.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL TL;DR Web archiving is not file backup. Backup = prevent, detect, repair changes Web archiving = continuous changes to replicate the past Naïve fixity techniques are not applicable for web archiving.
  • 5.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Monitoring Fixity To Detect Tampering == Endless False Positives https://www.theatlantic.com/business/archive/2016/05/car-alarms-dont-work-why-so-common/482769/ https://auto.howstuffworks.com/car-driving-safety/safety-regulatory-devices/whats-point-car-alarms-nobody-calls-cops.htm
  • 6.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL A simplified workflow of web archiving $ wget WARC/1.0 WARC-Type: warcinfo Content-Type: application/warc-fields WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7- 4f03-9de2-e578d105d3cb> WARC-Filename: foo.warc.warc.gz WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E Content-Length: 257 software: Wget/1.15 (linux-gnu) format: WARC File Format 1.0 [much deletia] 1) live web site https://climate.nasa.gov/vital-signs/carbon-dioxide/ 2) Crawled by any of several archival crawlers 3) Result stored in a WARC File (like tar or zip, but for Web archives) 4) WARC files are indexed, served by replay software (there are several variations of Wayback Machine) 5) User chooses date of capture (Memento-Datetime) 6) Page replayed with banner, rewritten links, etc.
  • 7.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL (apologies to Peter Arnett) “In order to save the page, we had to completely change it” Yes, some archives (including most versions of Wayback) provide “raw” access, but modifications can still happen (how/why is beyond the scope of this presentation).
  • 8.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL I’ve got mad HTML skillz https://www.cs.odu.edu/~mln/travel.html
  • 9.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html Archival Metadata The banner tells the user the original URL, which archive the page resides in, when it was archived, how many copies, etc. Links are rewritten to point back into the archive, not the live web.
  • 10.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html $ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5 <body bgcolor=white> <pre> -January 31-February 1, 2019, NYC, ACM Publications Board Meeting $ curl -s https://www.cs.odu.edu/~mln/travel.html | wc 585 2361 26471
  • 11.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html $ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5 <body bgcolor=white> <pre> -January 31-February 1, 2019, NYC, ACM Publications Board Meeting $ curl -s https://www.cs.odu.edu/~mln/travel.html | wc 585 2361 26471 $ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | head -5 <script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script> <script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var v=archive_analytics.values;v.service='wb';v.server_name='wwwb- app40.us.archive.org’;v.server_ms=208;archive_analytics.send_pageview({});});</script> <script type="text/javascript" src="/static/js/ait-client-rewrite.js?v=1538596186.0" charset="utf- 8"></script> <script type="text/javascript"> WB_wombat_Init('https://web.archive.org/web', '20181104174441', 'www.cs.odu.edu'); $ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | wc 618 2472 33787
  • 12.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Same page, archived at archive.today http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
  • 13.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Same page, archived at archive.today http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html $ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | head -5 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html style="background- color:#EEEEEE" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#" itemscope itemtype="http://schema.org/Article"><!--174.109.72.208--><!--curl/7.30.0--><head><meta http- equiv="Content-Type" content="text/html;charset=utf-8"/><meta name="robots" content="index,noarchive"/><meta name="viewport" content="device-width=300, initial-scale=1"/><meta property="twitter:card" content="summary"/><meta property="twitter:site" content="@archiveis"/><meta property="og:type" content="article"/><meta property="og:site_name" content="archive.is"/><meta property="og:url" content="http://archive.is/l6QdV" itemprop="url"/><meta property="og:title" content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:title" content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:description" content="archived 4 Nov 2018 17:46:33 UTC" itemprop="description"/><meta property="article:published_time" content="2018-11-04T17:46:33Z" itemprop="dateCreated"/><meta property="article:modified_time" content="2018-11-04T17:46:33Z" itemprop="dateModified"/><link rel="image_src" href="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="og:image" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png" itemprop="image"/><meta property="twitter:image" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="twitter:image:src" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="twitter:image:width" content="1024"/><meta property="twitter:image:height" content="768"/><link rel="icon" href="//www.google.com/s2/favicons?domain=www.cs.odu.edu"/><link rel="canonical" href="https://archive.is/l6QdV"/><link rel="bookmark" href="http://archive.today/20181104174633/https://www.cs.odu.edu/~mln/travel.html"/><title></title>< /head><body style="margin:0;background-color:#EEEEEE"><center><div id="HEADER" style="font- family:sans-serif;background [much deletia – you get the point] $ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | wc 730 3640 62392
  • 14.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL If we just had isolated, static pages (e.g., individual jpegs, pdfs, mp3s) then there’d be no problem. But HTML has: 1) links, 2) embedded resources (including iframes), and 3) Javascript, which can modify the HTML. And HTTP has no “bulk download”, so you can’t grab an entire site instantaneously.
  • 15.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL WARC/1.0 WARC-Type: warcinfo Content-Type: application/warc-fields WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Filename: climate.nasa.gov.warc.gz WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E Content-Length: 257 software: Wget/1.15 (linux-gnu) format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf robots: classic wget-arguments: "--warc-file=climate.nasa.gov" "https://climate.nasa.gov/vital-signs/carbon-dioxide/" WARC/1.0 WARC-Type: request WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Content-Type: application/http;msgtype=request WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638> WARC-IP-Address: 54.230.195.16 WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Content-Length: 141 GET /vital-signs/carbon-dioxide/ HTTP/1.1 User-Agent: Wget/1.15 (linux-gnu) Accept: */* Host: climate.nasa.gov Connection: Keep-Alive WARC/1.0 WARC-Type: response WARC-Record-ID: <urn:uuid:5d8861ef-93c5-4d9c-87b8-4f427f963f7c> WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Concurrent-To: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638> WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ WARC-Date: 2018-11-03T17:20:02Z We could hash the WARC file $ md5sum climate.nasa.gov.warc.gz 652853fe1bc8cb273cdf73aad8a489ca climate.nasa.gov.warc.gz But this nasa.gov page contains: •201 images •19 Javascript files •3 CSS files At a large archive like IA they could be in multiple WARC files; worst case is 224 WARC files. In general, the WARC file(s) corresponding to the replayed page will be unavailable to the user replaying the page.
  • 16.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL We can detect changes in the root HTML https://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  • 17.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL But what if the change is in an embedded resource?
  • 18.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Clearly we need to render the entire page, then compute the hash. Unfortunately, that’s not easy.
  • 19.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Load the archived page, get an eagle https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 20.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Hit “reload”, get a tiger https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 21.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Hit “reload” again, get a mountain https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 22.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL “Look on my Javascript, ye Mighty, and despair!”
  • 23.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Actually, the fws.gov example was super easy; most changes are much harder to trace. Mohamed Aturban, unpublished, memento: http://web.archive.org/web/20130724144801/http://www.cnn.com/ Animated GIF: https://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html
  • 24.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Temporal violations: reconstructing pages that never existed on the live web (examples below are transient; sometimes you get the 1st image, sometimes the 2nd image) embedded in umich.edu memento, archived in perma.cc 2nd image is compressed (12209 vs. 19448 bytes); 2nd image modified in 2017-03, but replayed in a 2017-01 page embedded in copybogger.com memento, archived in archive.org 2nd image modified in 2017-12, but replayed in a 2017-11 page; blackout for privacy Temporal violations: https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
  • 25.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL 1 WARC file, 2 Wayback Machines, 3 Browsers = 6 different replays http://wayback.archive-it.org/all/20130106140348/http://www.harvard.edu/ http://web.archive.org/web/20130106140348/http://www.harvard.edu/ see also. https://ws-dl.blogspot.com/2016/12/2016-12-20-archiving-pages-with.html
  • 26.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Experiment Design
  • 27.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Archive URI-Ms ----------------------------- perma-archives.org 182 bibalex.org 199 webarchive.org.uk 349 bac-lac.gc.ca 351 proni.gov.uk 469 digar.ee 488 webharvest.gov 712 internetmemory.org 979 nationalarchives.gov.uk 994 stanford.edu 1222 archive-it.org 1383 archive.is 1396 web.archive.org 1566 arquivo.pt 1569 webcitation.org 1585 vefsafn.is 1589 loc.gov 1594 ----------------------------- Total 16627 Sample 16k+ Mementos from 17 Web Archives
  • 28.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Periodically Replay Each Archived Page Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/ 35 times, from Nov. 2017 – Oct. 2018 For each replay, we download both the rewritten version and the “raw” version (where possible).
  • 29.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Periodically Replay Each Archived Page Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/ 35 times, from Nov. 2017 – Oct. 2018 For each replay, we download both the rewritten version and the “raw” version (where possible). Partial archive outage because of security / maintenance upgrade
  • 30.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Periodically Replay Each Archived Page Above example: http://perma-archives.org/warc/20170101182813/http://umich.edu/ 35 times, from Nov. 2017 – Oct. 2018 For each replay, we download both the rewritten version and the “raw” version (where possible). Post-upgrade, replay is variable.
  • 31.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL In 11 months, 11% of the URLs Disappeared or Changed 820 were renamed & required manual rediscovery 979 disappeared & have not yet been rediscovered
  • 32.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL europarchive.org became internetmemory.org URI-Ms like this: collection.europarchive.org/nli/20130117165443/http://bbc.co.uk/news/ changed domains and became like this: collections.internetmemory.org/nli/20130117165443/http://bbc.co.uk/news/ europarchive.org
  • 33.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL europarchive.org is now spam internetmemory.org is now down 979 pages lost curl –I collection.europarchive.org/nli/20130117165443/http:/bbc.co.uk/news/ HTTP/1.1 301 Moved Permanently Date: Mon, 10 Dec 2018 04:30:50 GMT Server: Apache Expires: Mon, 10 Dec 2018 05:30:50 GMT Cache-Control: max-age=3600 Location: http://europarchive.org Connection: close Content-Type: text/html; charset=UTF-8 curl -I collections.internetmemory.org/nli/20130117165443/http://bbc.co.uk/news/ HTTP/1.1 403 Forbidden Date: Mon, 10 Dec 2018 04:31:51 GMT Server: Varnish X-Varnish: 71167297 Content-Type: text/html; charset=utf-8 Retry-After: 5 Content-Length: 252 Connection: keep-alive
  • 34.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Telling humans your domain is about to change is nice, but please tell robots too… https://web.archive.org/web/20180104021440/http://europarchive.org/ See: https://tools.ietf.org/id/draft-wilde-sunset-header-03.html
  • 35.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL webarchive.proni.gov.uk now uses Archive-It The top-level site webarchive.proni.gov.uk still exists, but deep links to URI-Ms are now 404. 469 pages required manual rediscovery. curl -I http://webarchive.proni.gov.uk/20111215021058/http://women.sohu.com/ HTTP/1.1 404 Not Found Date: Mon, 10 Dec 2018 05:15:56 GMT Server: Apache/2.4.18 (Ubuntu) Content-Type: text/html; charset=iso-8859-1 curl -I https://wayback.archive-it.org/11112/20111215021058/http://women.sohu.com/ HTTP/1.1 200 OK ...
  • 36.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL www.collectionscanada.gc.ca became webarchive.bac-lac.gc.ca:8080 (no, really – port 8080) And deep links to URI-Ms now redirect to the top of the new site, which means 351 pages required manual rediscovery: $ curl -IL http://www.collectionscanada.gc.ca/webarchives/20061027192435/http://www.state.gov/ HTTP/1.0 302 Found Location: http://www.bac-lac.gc.ca/eng/discover/archives-web-government/Pages/web-archives.aspx ... HTTP/1.1 302 Found Location: http://webarchive.bac-lac.gc.ca/?lang=en Date: Mon, 10 Dec 2018 04:46:24 GMT ... HTTP/1.1 200 Date: Mon, 10 Dec 2018 04:46:24 GMT ...
  • 37.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL URI-Ms with at Archive Name URI-Ms least two hashes ------------------------------------------------------ webarchive.loc.gov 1,594 1,235 (77.47%) vefsafn.is 1,589 1,133 (71.30%) webcitation.org 1,585 981 (61.89%) arquivo.pt 1,569 1,563 (99.61%) archive.org 1,566 1,430 (91.31%) archive.is 1,396 1,364 (97.70%) archive-it.org 1,383 1,383 (100%) swap.stanford.edu 1,222 1,005 (82.24%) nationalarchives.gov.uk 994 978 (98.39%) internetmemory.org 979 979 (100%) webharvest.gov 712 712 (100%) digar.ee 488 308 (63.11%) proni.gov.uk 469 469 (100%) bac-lac.gc.ca 351 351 (100.0%) webarchive.org.uk 349 348 (99.71%) archive.bibalex.org 199 199 (100%) perma-archives.org 182 182 (100%) ------------------------------------------------------ total 16,627 14,620 (87.92%) You cannot replay twice the same archived page (apologies to Heraclitus) 7 out of 8 pages produced > 1 hash over 11 months
  • 38.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL More Archived Pages Changed Every Time Than Never Changed Never changed: 2007 URI-Ms (1 in 8) Always changed: 2773 URI-Ms (1 in 6)
  • 39.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL A metaphor for replaying archived web pages https://www.youtube.com/watch?v=ekO3Z3XWa0Q https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
  • 40.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL King of Swamp Castle: live web/ground truth Guard: archival replay https://www.youtube.com/watch?v=ekO3Z3XWa0Q https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail $ echo "Make sure the prince doesn't leave this room until I come and get him." | md5 57facbb2734d36cb823f4230cc07b888 $ echo "Not to leave the room even if you come and get him." | md5 3ba0a2359d63f43cbe9e11fb5a179b8d $ echo "Until you come and get him, we're not to enter the room." | md5 ade3539aaa8a6d8724193e9a37f3ca6d $ echo "We don't need to do anything apart from just stop him entering the room." | md5 ea812f5b997aa42a8f293bd1ee536fd0 $ echo "Oh yes, we'll keep him in here, obviously. But if he had to leave, and we went with him..." | md5 55d184b77d99eed6367535ef3c05d7aa $ echo "Oh, yes of course. I thought you meant him! You know it seemed a bit daft me having to guard him when he's a guard." | md5
  • 41.
    CNI Fall 2018Membership Meeting, 2018-12-11, @phonedude_mln, @WebSciDL Archival replay & blockchain: building a castle in a swamp • Fixity checks only work when it’s clear what to hash – Hash only the root HTML and modifications are possible via embedded resources (false negatives) – Recursively hash all embedded resources and you’ll rarely get the same hash (false positives) • Replay is working as designed, it’s not something that will be “fixed” – we need server-side support for auditing, and archive-aware hashing functions • There is increasing incentive to attack existing archives and create networks of fake archives – http://bit.ly/Weaponized-Web-Archives