SlideShare a Scribd company logo
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Blockchain Can Not Be Used To Verify
Replayed Archived Web Pages
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @phonedude_mln
With:
ODU: Michele C. Weigle, Mohamed Aturban
Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein
Supported in part by The Andrew Mellon Foundation
and the National Science Foundation
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
This is not what you think it is…
https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
This is not what you think it is…
https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
“…right now you can get timestamps for every book,
movie, song, computer program, legal document,
etc. in the thousands of collections in the archive.
In the future we hope to be able to work with the
Internet Archive to extend this to timestamping
website snapshots…”
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
TL;DR
Web archiving is not file backup.
Backup = prevent, detect, repair changes
Web archiving = continuous changes to replicate the past
Naïve fixity techniques are
not applicable for web archiving.
Since 3rd
party audits are not feasible, as web archives proliferate
verifying web archives will centralize / federate.
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
A simplified workflow of web archiving
$ wget
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2018-11-03T17:20:02Z
WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-
4f03-9de2-e578d105d3cb>
WARC-Filename: foo.warc.warc.gz
WARC-Block-Digest:
sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E
Content-Length: 257
software: Wget/1.15 (linux-gnu)
format: WARC File Format 1.0
[much deletia]
1) live web site
https://climate.nasa.gov/vital-signs/carbon-dioxide/
2) Crawled by any of
several archival crawlers 3) Result stored in a WARC File
(like tar or zip, but for Web archives)
4) WARC files are indexed,
served by replay software
(there are several variations of
Wayback Machine)
5) User chooses date of
capture (Memento-Datetime)
6) Page replayed with banner,
rewritten links, etc.
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
(apologies to Peter Arnett)
“In order to save the page,
we had to completely change it”
Yes, some archives (including most versions of Wayback) provide “raw” access,
but modifications can still happen (how/why is beyond the scope of this presentation).
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
I’ve got mad HTML skillz
https://www.cs.odu.edu/~mln/travel.html
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Same page, archived at IA
https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html
Archival Metadata
The banner tells the user the original URL,
which archive the page resides in,
when it was archived, how many copies, etc.
Links are rewritten to point back
into the archive, not the live web.
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Same page, archived at IA
https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html
$ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5
<body bgcolor=white>
<pre>
-January 31-February 1, 2019, NYC, ACM Publications Board Meeting
$ curl -s https://www.cs.odu.edu/~mln/travel.html | wc
585 2361 26471
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Same page, archived at IA
https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html
$ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5
<body bgcolor=white>
<pre>
-January 31-February 1, 2019, NYC, ACM Publications Board Meeting
$ curl -s https://www.cs.odu.edu/~mln/travel.html | wc
585 2361 26471
$ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | head -5
<script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script>
<script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var
v=archive_analytics.values;v.service='wb';v.server_name='wwwb-
app40.us.archive.org’;v.server_ms=208;archive_analytics.send_pageview({});});</script>
<script type="text/javascript" src="/static/js/ait-client-rewrite.js?v=1538596186.0" charset="utf-
8"></script>
<script type="text/javascript">
WB_wombat_Init('https://web.archive.org/web', '20181104174441', 'www.cs.odu.edu');
$ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | wc
618 2472 33787
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Same page, archived at archive.today
http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Same page, archived at archive.today
http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
$ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | head -5
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html style="background-
color:#EEEEEE" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#" itemscope
itemtype="http://schema.org/Article"><!--174.109.72.208--><!--curl/7.30.0--><head><meta http-
equiv="Content-Type" content="text/html;charset=utf-8"/><meta name="robots"
content="index,noarchive"/><meta name="viewport" content="device-width=300, initial-scale=1"/><meta
property="twitter:card" content="summary"/><meta property="twitter:site" content="@archiveis"/><meta
property="og:type" content="article"/><meta property="og:site_name" content="archive.is"/><meta
property="og:url" content="http://archive.is/l6QdV" itemprop="url"/><meta property="og:title"
content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:title"
content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:description"
content="archived 4 Nov 2018 17:46:33 UTC" itemprop="description"/><meta
property="article:published_time" content="2018-11-04T17:46:33Z" itemprop="dateCreated"/><meta
property="article:modified_time" content="2018-11-04T17:46:33Z" itemprop="dateModified"/><link
rel="image_src"
href="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta
property="og:image"
content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"
itemprop="image"/><meta property="twitter:image"
content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta
property="twitter:image:src"
content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta
property="twitter:image:width" content="1024"/><meta property="twitter:image:height"
content="768"/><link rel="icon" href="//www.google.com/s2/favicons?domain=www.cs.odu.edu"/><link
rel="canonical" href="https://archive.is/l6QdV"/><link rel="bookmark"
href="http://archive.today/20181104174633/https://www.cs.odu.edu/~mln/travel.html"/><title></title><
/head><body style="margin:0;background-color:#EEEEEE"><center><div id="HEADER" style="font-
family:sans-serif;background
[much deletia – you get the point]
$ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | wc
730 3640 62392
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
If we just had isolated, static pages
(e.g., individual jpegs, pdfs, mp3s)
then there’d be no problem.
But HTML has:
1) links,
2) embedded resources (including iframes), and
3) Javascript, which can modify the HTML.
And HTTP has no “bulk download”,
so you can’t grab an entire site instantaneously.
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2018-11-03T17:20:02Z
WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
WARC-Filename: climate.nasa.gov.warc.gz
WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E
Content-Length: 257
software: Wget/1.15 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
robots: classic
wget-arguments: "--warc-file=climate.nasa.gov" "https://climate.nasa.gov/vital-signs/carbon-dioxide/"
WARC/1.0
WARC-Type: request
WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Content-Type: application/http;msgtype=request
WARC-Date: 2018-11-03T17:20:02Z
WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
WARC-IP-Address: 54.230.195.16
WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Content-Length: 141
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:5d8861ef-93c5-4d9c-87b8-4f427f963f7c>
WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
WARC-Concurrent-To: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
WARC-Date: 2018-11-03T17:20:02Z
We could hash the
WARC file
$ md5sum climate.nasa.gov.warc.gz
652853fe1bc8cb273cdf73aad8a489ca climate.nasa.gov.warc.gz
But this nasa.gov page contains:
•201 images
•19 Javascript files
•3 CSS files
At a large archive like IA they
could be in multiple WARC files;
worst case is 224 WARC files.
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
We can detect changes in the root HTML
https://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
But what if the change is in
an embedded resource?
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Clearly we need to render the entire page,
then compute the hash.
Unfortunately, that’s not easy.
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Load the archived page, get an eagle
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Hit “reload”, get a tiger
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Hit “reload” again, get a mountain
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
“Look on my Javascript, ye Mighty, and despair!”
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Actually, the fws.gov example was super easy;
most changes are much harder to trace.
Mohamed Aturban, unpublished, memento:
http://web.archive.org/web/20130724144801/http://www.cnn.com/
Animated GIF: https://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Temporal violations:
reconstructing pages that never existed on the live web
(examples below are transient; sometimes you get the 1st
image, sometimes the 2nd
image)
embedded in umich.edu memento, archived in perma.cc
2nd
image is compressed (12209 vs. 19448 bytes); 2nd
image modified in 2017-03, but replayed in a 2017-01 page
embedded in copybogger.com memento, archived in archive.org
2nd
image modified in 2017-12, but replayed in a 2017-11 page; blackout for privacy
Temporal violations: https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
1 WARC file, 2 Wayback Machines, 3 Browsers
= 6 different replays
http://wayback.archive-it.org/all/20130106140348/http://www.harvard.edu/
http://web.archive.org/web/20130106140348/http://www.harvard.edu/
see also. https://ws-dl.blogspot.com/2016/12/2016-12-20-archiving-pages-with.html
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Archive URIMs With at least two hashes
----------------------------------------------------------------
webharvest.gov 712 712 (100%)
archive.is 1396 1364 (97.70%)
vefsafn.is 1589 739 (46.50%)
archive-it.org 1383 815 (58.92%)
stanford.edu 1222 831 (68.00%)
internetmemory.org 979 979 (100%)
nationalarchives.gov.uk 994 972 (97.78%)
archive.bibalex.org 199 177 (88.94%)
bac-lac.gc.ca 351 351 (100%)
proni.gov.uk 469 129 (27.50%)
www.webarchive.org.uk 349 329 (94.26%)
www.webcitation.org 1585 828 (52.23%)
veebiarhiiv.digar.ee 488 308 (63.11%)
webarchive.loc.gov 1594 526 (32.99%)
arquivo.pt 1569 1563 (99.61%)
web.archive.org 1566 1334 (85.18%)
perma-archives.org 182 180 (98.90%)
----------------------------------------------------------------
16627 12137 (72.99%)
Data from 35 downloads over an 11 month period (2017-11 – 2018-10), Mohamed Aturban (in preparation)
(apologies to Heraclitus)
You cannot replay twice the same archived page
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
A metaphor for replaying archived web pages
https://www.youtube.com/watch?v=ekO3Z3XWa0Q
https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
King of Swamp Castle: live web/ground truth
Guard: archival replay
https://www.youtube.com/watch?v=ekO3Z3XWa0Q
https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
$ echo "Make sure the prince doesn't leave this room until I come and get him." | md5
57facbb2734d36cb823f4230cc07b888
$ echo "Not to leave the room even if you come and get him." | md5
3ba0a2359d63f43cbe9e11fb5a179b8d
$ echo "Until you come and get him, we're not to enter the room." | md5
ade3539aaa8a6d8724193e9a37f3ca6d
$ echo "We don't need to do anything apart from just stop him entering the room." | md5
ea812f5b997aa42a8f293bd1ee536fd0
$ echo "Oh yes, we'll keep him in here, obviously. But if he had to leave, and we went with him..." | md5
55d184b77d99eed6367535ef3c05d7aa
$ echo "Oh, yes of course. I thought you meant him! You know it seemed a bit daft me having to guard him
when he's a guard." | md5
Symposium on Blockchain and Trusted Repositories, 2018-11-05,
@phonedude_mln, @WebSciDL
Archival replay & blockchain:
building a castle in a swamp
• Fixity checks only work when it’s clear what to hash
– Hash only the root HTML and modifications are possible via embedded
resources (false negatives)
– Recursively hash all embedded resources and you’ll rarely get the
same hash (false positives)
• There is increasing incentive to attack existing archives and
create networks of fake archives
– http://bit.ly/Weaponized-Web-Archives
• We are investigating archive-aware hashing functions
– Vagaries of replay won’t be “fixed”; we need to adapt our hashing
• Central vs. distributed?
– Ask yourself: “is this a winner-take-all market?” (hello, FANG)
– https://blog.dshr.org/2018/09/blockchain-solves-preservation.html

More Related Content

What's hot

What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
Emily Nimsakont
 

What's hot (20)

Linked Data Patterns
Linked Data PatternsLinked Data Patterns
Linked Data Patterns
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Linked Open Data for Archives
Linked Open Data for ArchivesLinked Open Data for Archives
Linked Open Data for Archives
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
Linked Data: turning the web into a context graph
Linked Data: turning the web into a context graphLinked Data: turning the web into a context graph
Linked Data: turning the web into a context graph
 
21st Century Archival Appraisal
21st Century Archival Appraisal21st Century Archival Appraisal
21st Century Archival Appraisal
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
 
Life after MARC: Experimenting with Cataloging Tools of the Future
Life after MARC: Experimenting with Cataloging Tools of the FutureLife after MARC: Experimenting with Cataloging Tools of the Future
Life after MARC: Experimenting with Cataloging Tools of the Future
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
 
Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?Linked Data and Libraries: What? Why? How?
Linked Data and Libraries: What? Why? How?
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
GODORT SLDTF 2009 Meeting Outline
GODORT SLDTF 2009 Meeting OutlineGODORT SLDTF 2009 Meeting Outline
GODORT SLDTF 2009 Meeting Outline
 
Life after MARC: Cataloging Tools of the Future
Life after MARC: Cataloging Tools of the FutureLife after MARC: Cataloging Tools of the Future
Life after MARC: Cataloging Tools of the Future
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 

Similar to Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

Representing the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makers
judell
 
Presenting Your Digital Research
Presenting Your Digital ResearchPresenting Your Digital Research
Presenting Your Digital Research
Shawn Day
 

Similar to Blockchain Can Not Be Used To Verify Replayed Archived Web Pages (20)

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Bridging the Gap Between Print and Digital Environment
Bridging the Gap Between Print and Digital EnvironmentBridging the Gap Between Print and Digital Environment
Bridging the Gap Between Print and Digital Environment
 
Representing the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makers
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Informal presentation about RES
Informal presentation about RESInformal presentation about RES
Informal presentation about RES
 
Stream Reasoning : Where We Got So Far
Stream Reasoning: Where We Got So FarStream Reasoning: Where We Got So Far
Stream Reasoning : Where We Got So Far
 
HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10HKU Data Curation MLIM7350 Class 10
HKU Data Curation MLIM7350 Class 10
 
Lodlam presentation v1.0 final al20151104
Lodlam presentation v1.0 final al20151104Lodlam presentation v1.0 final al20151104
Lodlam presentation v1.0 final al20151104
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
 
Presenting Your Digital Research
Presenting Your Digital ResearchPresenting Your Digital Research
Presenting Your Digital Research
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 

More from Michael Nelson

More from Michael Nelson (20)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 

Recently uploaded

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

  • 1. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Blockchain Can Not Be Used To Verify Replayed Archived Web Pages Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @phonedude_mln With: ODU: Michele C. Weigle, Mohamed Aturban Los Alamos National Laboratory: Herbert Van de Sompel, Martin Klein Supported in part by The Andrew Mellon Foundation and the National Science Foundation
  • 2. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps
  • 3. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL This is not what you think it is… https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps “…right now you can get timestamps for every book, movie, song, computer program, legal document, etc. in the thousands of collections in the archive. In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshots…”
  • 4. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL TL;DR Web archiving is not file backup. Backup = prevent, detect, repair changes Web archiving = continuous changes to replicate the past Naïve fixity techniques are not applicable for web archiving. Since 3rd party audits are not feasible, as web archives proliferate verifying web archives will centralize / federate.
  • 5. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL A simplified workflow of web archiving $ wget WARC/1.0 WARC-Type: warcinfo Content-Type: application/warc-fields WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7- 4f03-9de2-e578d105d3cb> WARC-Filename: foo.warc.warc.gz WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E Content-Length: 257 software: Wget/1.15 (linux-gnu) format: WARC File Format 1.0 [much deletia] 1) live web site https://climate.nasa.gov/vital-signs/carbon-dioxide/ 2) Crawled by any of several archival crawlers 3) Result stored in a WARC File (like tar or zip, but for Web archives) 4) WARC files are indexed, served by replay software (there are several variations of Wayback Machine) 5) User chooses date of capture (Memento-Datetime) 6) Page replayed with banner, rewritten links, etc.
  • 6. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL (apologies to Peter Arnett) “In order to save the page, we had to completely change it” Yes, some archives (including most versions of Wayback) provide “raw” access, but modifications can still happen (how/why is beyond the scope of this presentation).
  • 7. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL I’ve got mad HTML skillz https://www.cs.odu.edu/~mln/travel.html
  • 8. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html Archival Metadata The banner tells the user the original URL, which archive the page resides in, when it was archived, how many copies, etc. Links are rewritten to point back into the archive, not the live web.
  • 9. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html $ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5 <body bgcolor=white> <pre> -January 31-February 1, 2019, NYC, ACM Publications Board Meeting $ curl -s https://www.cs.odu.edu/~mln/travel.html | wc 585 2361 26471
  • 10. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at IA https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html $ curl -s https://www.cs.odu.edu/~mln/travel.html | head -5 <body bgcolor=white> <pre> -January 31-February 1, 2019, NYC, ACM Publications Board Meeting $ curl -s https://www.cs.odu.edu/~mln/travel.html | wc 585 2361 26471 $ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | head -5 <script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script> <script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var v=archive_analytics.values;v.service='wb';v.server_name='wwwb- app40.us.archive.org’;v.server_ms=208;archive_analytics.send_pageview({});});</script> <script type="text/javascript" src="/static/js/ait-client-rewrite.js?v=1538596186.0" charset="utf- 8"></script> <script type="text/javascript"> WB_wombat_Init('https://web.archive.org/web', '20181104174441', 'www.cs.odu.edu'); $ curl -s https://web.archive.org/web/20181104174441/https://www.cs.odu.edu/~mln/travel.html | wc 618 2472 33787
  • 11. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at archive.today http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html
  • 12. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Same page, archived at archive.today http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html $ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | head -5 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html style="background- color:#EEEEEE" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article#" itemscope itemtype="http://schema.org/Article"><!--174.109.72.208--><!--curl/7.30.0--><head><meta http- equiv="Content-Type" content="text/html;charset=utf-8"/><meta name="robots" content="index,noarchive"/><meta name="viewport" content="device-width=300, initial-scale=1"/><meta property="twitter:card" content="summary"/><meta property="twitter:site" content="@archiveis"/><meta property="og:type" content="article"/><meta property="og:site_name" content="archive.is"/><meta property="og:url" content="http://archive.is/l6QdV" itemprop="url"/><meta property="og:title" content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:title" content="https://www.cs.odu.edu/~mln/travel.html"/><meta property="twitter:description" content="archived 4 Nov 2018 17:46:33 UTC" itemprop="description"/><meta property="article:published_time" content="2018-11-04T17:46:33Z" itemprop="dateCreated"/><meta property="article:modified_time" content="2018-11-04T17:46:33Z" itemprop="dateModified"/><link rel="image_src" href="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="og:image" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png" itemprop="image"/><meta property="twitter:image" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="twitter:image:src" content="https://archive.is/l6QdV/d7e3acef18a0433590880dfcc26f8e1f5f18f91e/scr.png"/><meta property="twitter:image:width" content="1024"/><meta property="twitter:image:height" content="768"/><link rel="icon" href="//www.google.com/s2/favicons?domain=www.cs.odu.edu"/><link rel="canonical" href="https://archive.is/l6QdV"/><link rel="bookmark" href="http://archive.today/20181104174633/https://www.cs.odu.edu/~mln/travel.html"/><title></title>< /head><body style="margin:0;background-color:#EEEEEE"><center><div id="HEADER" style="font- family:sans-serif;background [much deletia – you get the point] $ curl -s http://archive.is/20181104174633/https://www.cs.odu.edu/~mln/travel.html | wc 730 3640 62392
  • 13. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL If we just had isolated, static pages (e.g., individual jpegs, pdfs, mp3s) then there’d be no problem. But HTML has: 1) links, 2) embedded resources (including iframes), and 3) Javascript, which can modify the HTML. And HTTP has no “bulk download”, so you can’t grab an entire site instantaneously.
  • 14. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL WARC/1.0 WARC-Type: warcinfo Content-Type: application/warc-fields WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Filename: climate.nasa.gov.warc.gz WARC-Block-Digest: sha1:WWSSYDYY7HTP4JTVOZANSIFPFHUJU64E Content-Length: 257 software: Wget/1.15 (linux-gnu) format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf robots: classic wget-arguments: "--warc-file=climate.nasa.gov" "https://climate.nasa.gov/vital-signs/carbon-dioxide/" WARC/1.0 WARC-Type: request WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Content-Type: application/http;msgtype=request WARC-Date: 2018-11-03T17:20:02Z WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638> WARC-IP-Address: 54.230.195.16 WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Content-Length: 141 GET /vital-signs/carbon-dioxide/ HTTP/1.1 User-Agent: Wget/1.15 (linux-gnu) Accept: */* Host: climate.nasa.gov Connection: Keep-Alive WARC/1.0 WARC-Type: response WARC-Record-ID: <urn:uuid:5d8861ef-93c5-4d9c-87b8-4f427f963f7c> WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb> WARC-Concurrent-To: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638> WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ WARC-Date: 2018-11-03T17:20:02Z We could hash the WARC file $ md5sum climate.nasa.gov.warc.gz 652853fe1bc8cb273cdf73aad8a489ca climate.nasa.gov.warc.gz But this nasa.gov page contains: •201 images •19 Javascript files •3 CSS files At a large archive like IA they could be in multiple WARC files; worst case is 224 WARC files.
  • 15. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL We can detect changes in the root HTML https://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  • 16. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL But what if the change is in an embedded resource?
  • 17. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Clearly we need to render the entire page, then compute the hash. Unfortunately, that’s not easy.
  • 18. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Load the archived page, get an eagle https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 19. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Hit “reload”, get a tiger https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 20. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Hit “reload” again, get a mountain https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 21. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL “Look on my Javascript, ye Mighty, and despair!”
  • 22. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Actually, the fws.gov example was super easy; most changes are much harder to trace. Mohamed Aturban, unpublished, memento: http://web.archive.org/web/20130724144801/http://www.cnn.com/ Animated GIF: https://blog.dshr.org/2017/11/keynote-at-pacific-neighborhood.html
  • 23. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Temporal violations: reconstructing pages that never existed on the live web (examples below are transient; sometimes you get the 1st image, sometimes the 2nd image) embedded in umich.edu memento, archived in perma.cc 2nd image is compressed (12209 vs. 19448 bytes); 2nd image modified in 2017-03, but replayed in a 2017-01 page embedded in copybogger.com memento, archived in archive.org 2nd image modified in 2017-12, but replayed in a 2017-11 page; blackout for privacy Temporal violations: https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
  • 24. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL 1 WARC file, 2 Wayback Machines, 3 Browsers = 6 different replays http://wayback.archive-it.org/all/20130106140348/http://www.harvard.edu/ http://web.archive.org/web/20130106140348/http://www.harvard.edu/ see also. https://ws-dl.blogspot.com/2016/12/2016-12-20-archiving-pages-with.html
  • 25. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Archive URIMs With at least two hashes ---------------------------------------------------------------- webharvest.gov 712 712 (100%) archive.is 1396 1364 (97.70%) vefsafn.is 1589 739 (46.50%) archive-it.org 1383 815 (58.92%) stanford.edu 1222 831 (68.00%) internetmemory.org 979 979 (100%) nationalarchives.gov.uk 994 972 (97.78%) archive.bibalex.org 199 177 (88.94%) bac-lac.gc.ca 351 351 (100%) proni.gov.uk 469 129 (27.50%) www.webarchive.org.uk 349 329 (94.26%) www.webcitation.org 1585 828 (52.23%) veebiarhiiv.digar.ee 488 308 (63.11%) webarchive.loc.gov 1594 526 (32.99%) arquivo.pt 1569 1563 (99.61%) web.archive.org 1566 1334 (85.18%) perma-archives.org 182 180 (98.90%) ---------------------------------------------------------------- 16627 12137 (72.99%) Data from 35 downloads over an 11 month period (2017-11 – 2018-10), Mohamed Aturban (in preparation) (apologies to Heraclitus) You cannot replay twice the same archived page
  • 26. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL A metaphor for replaying archived web pages https://www.youtube.com/watch?v=ekO3Z3XWa0Q https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail
  • 27. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL King of Swamp Castle: live web/ground truth Guard: archival replay https://www.youtube.com/watch?v=ekO3Z3XWa0Q https://en.wikipedia.org/wiki/Monty_Python_and_the_Holy_Grail $ echo "Make sure the prince doesn't leave this room until I come and get him." | md5 57facbb2734d36cb823f4230cc07b888 $ echo "Not to leave the room even if you come and get him." | md5 3ba0a2359d63f43cbe9e11fb5a179b8d $ echo "Until you come and get him, we're not to enter the room." | md5 ade3539aaa8a6d8724193e9a37f3ca6d $ echo "We don't need to do anything apart from just stop him entering the room." | md5 ea812f5b997aa42a8f293bd1ee536fd0 $ echo "Oh yes, we'll keep him in here, obviously. But if he had to leave, and we went with him..." | md5 55d184b77d99eed6367535ef3c05d7aa $ echo "Oh, yes of course. I thought you meant him! You know it seemed a bit daft me having to guard him when he's a guard." | md5
  • 28. Symposium on Blockchain and Trusted Repositories, 2018-11-05, @phonedude_mln, @WebSciDL Archival replay & blockchain: building a castle in a swamp • Fixity checks only work when it’s clear what to hash – Hash only the root HTML and modifications are possible via embedded resources (false negatives) – Recursively hash all embedded resources and you’ll rarely get the same hash (false positives) • There is increasing incentive to attack existing archives and create networks of fake archives – http://bit.ly/Weaponized-Web-Archives • We are investigating archive-aware hashing functions – Vagaries of replay won’t be “fixed”; we need to adapt our hashing • Central vs. distributed? – Ask yourself: “is this a winner-take-all market?” (hello, FANG) – https://blog.dshr.org/2018/09/blockchain-solves-preservation.html