SlideShare a Scribd company logo
1 of 57
Download to read offline
It is hard to compute fixity
on archived web pages
Mohamed Aturban, Michael L. Nelson, Michele C. Weigle
Web Science and Digital Libraries Research Group
Old Dominion University, Norfolk, VA, 23529
Old Dominion University
WADL 2018, June 6, 2018, Fort Worth, TX, USA
Supported in part by The Andrew W. Mellon Foundation (AMF) grant 11600663
WADL 2018, 2018-06-06 @maturban1
2
Do archived pages change?
Time
climate.nasa.gov/vital-signs/carbon-dioxide/
Live
Web
t0 t9 t14
WADL 2018, 2018-06-06 @maturban1
3
Do archived pages change?
TimeLive
Web
TimeArchive URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
t0 t2 t4 t6 t9 t14 t16 t18
WADL 2018, 2018-06-06 @maturban1
4
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
WADL 2018, 2018-06-06 @maturban1
5
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
WADL 2018, 2018-06-06 @maturban1
When replaying URI-M2 at different points in
time, will we get the same content?
6
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
WADL 2018, 2018-06-06 @maturban1
When replaying URI-M2 at different points in
time, will we get the same content?
7
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
WADL 2018, 2018-06-06 @maturban1
When replaying URI-M2 at different points in
time, will we get the same content?
8
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
When replaying URI-M2 at different points in
time, will we get the same content?
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
WADL 2018, 2018-06-06 @maturban1
9
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
WADL 2018, 2018-06-06 @maturban1
Our study shows that we are not always
presented with the same archived content!
?
10
Cryptographic hashes to create
fixity information
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input à a large change output
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2018.
SHA256
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2019.
SHA256
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
WADL 2018, 2018-06-06 @maturban1
11
This is what climate.nasa.gov/vital-signs/carbon-dioxide/
looks like right now
12
Generate hashes on a memento
• Compute a hash value on the downloaded HTML content
% curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/
| shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d
Compute SHA256 hash
Download the page
WADL 2018, 2018-06-06 @maturban1
Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Hash
HTML
content is
downloaded
fc90 88b3 a614 a588 40bd 5387 d93c 16be
824c d2bb b3fa b173 f93f a57d 241a 3790
SHA256
Hash
August 2017
October 2017
The archived page has been tampered with by changing the value of COSeptember 2017
2
13
• Compare the current hash and the previous hash
To verify fixity
Hashes are NOT identical à the page has changed!
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
14
What if an image has changed?
Computing hashes on only HTML content will
NOT detect changes
WADL 2018, 2018-06-06 @maturban1
15
Potential solution: include all
resources in hash calculation
https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/
• 201 images
• 19 JavaScript files
• 3 CSS files
• Main HTML file
A single aggregated
hash value
www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page )
has
A composite memento
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
Turns out it is hard to get
repeatable hashes on
composite mementos
16
Archives transform original content to
appropriately replay mementos in a user’s
browser
• Add banners
• Rewrite links to point to the archive, not to the
live web
• Modify HTML code to convey metadata
• Apply some policies for security (e.g., block
some content)
• Provide the content in different format (e.g., ZIP
and screenshots)
Transformation examples:
17
Archives add banners
• To convey information like the number of mementos and
inform users that what they are viewing is from the archive
• Banners change à different hashes
Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos)
http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
18
Archives rewrite links to embedded
resources
web.archive.org/web/19961120150251 /http://www.usnews.com:80/
http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
19
Live web resources linked from archives
• Resources from the live web are expected to change à different hashes
• Based on feedback from Lerner et al., IA solved this issue with Content-
Security-Policy HTTP header, but the problem might still occur in other archives
http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Archived in 2008
The ad is from 2012
This memento was
replayed in 2012
A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer
and Communications Security (CCS), pages 1741–1755, 2017.
20
Caches may temporarily hide
changes in the playback
% date Mon Oct 2 01:15:18 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 01:16:29 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 01:19:31 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 02:10:24 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
dda6a9bf091d412cbdc2226ce3eb1059
X-Page-Cache: MISS
X-Page-Cache: HIT
X-Page-Cache: MISS
X-Page-Cache: HIT
21
Dynamic content by JS à different hashes
WADL 2018, 2018-06-06 @maturban1
22
Dynamic content by JS à different hashes
WADL 2018, 2018-06-06 @maturban1
23
Dynamic content by JS à different hashes
WADL 2018, 2018-06-06 @maturban1
24
Dynamic content by JS à different hashes
WADL 2018, 2018-06-06 @maturban1
25
Dynamic content by JS à different hashes
A large number of
mementos are unavailable
WADL 2018, 2018-06-06 @maturban1
26
A resource selected randomly by
JavaScript
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
27
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
28
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
29
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
function random_imglink(){
myimages[1]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg";
myimages[2]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannereagle.jpg";
myimages[3]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannertiger.jpg";
var ry=Math.floor(Math.random(1)*myimages.length)
if (ry==0)
ry=1
document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img
src="'+myimages[ry]+'" border="0" alt="The Open Spaces
Blog. A Talk on the Wild Side. Click to Read"></a>')
}
30
Changes in TimeMaps
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
A TimeMap = a list of available mementos =
URI-M1
URI-M2
URI-M3
URI-M4
URI-M5
31
The requested memento is unavailable
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
X
WADL 2018, 2018-06-06 @maturban1
32
Mementos with the same content
are not available too
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
XX X
WADL 2018, 2018-06-06 @maturban1
33
URI-M2 redirects to other memento (URI-M4)
which has different content
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
XX X
WADL 2018, 2018-06-06 @maturban1
302 Redirect
HTML
content is
downloaded
d13a 247e 872e 11d5 64f4 b49a 24d8 275c
a09f ee8d 48c0 0345 f458 5d4b 7ec3 e663
Hash
HTML
content is
downloaded
55b5 6d82 7f98 f81e 3fc6 9e03 c0c1 f739
7fa4 0bff 4e36 0303 9ddd 50a2 6ae2 8229
Hash
Novermber 2017
December 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
Changes in TimeMaps à different HTTP entity à different hashes
URI-M1 was
NOT available
URI-M1
URI-M2
• You can see the difference in the URI-M of the main HTML file
web.archive.org/web/20080828005922/http://www.evangelcogdayton.org/
web.archive.org/web/20090211151609/http://www.evangelcogdayton.org/
December 2017
March 2018
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
35
URI-M1 was
NOT available
URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
Changes in TimeMaps à different image à different hashes
• You can't see the difference in the URI-M of the main HTML file, but you
can see the difference in the embedded images
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
December 12, 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2December 25, 2017
URI-M1 = perma-archives.org/warc/20170101182814id_/
http://umich.edu/includes/image/type/gallery/id/113/name/Resea
rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M2 = perma-archives.org/warc/20170619145458id_/
http://umich.edu/includes/image/type/gallery/id/113/name/Resea
rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M1 was
NOT available
Different image
Changes in TimeMaps à different image that looks the same à different hashes
• You can't see the difference in the URI-M of the main HTML file nor the
difference in the embedded images
http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
37
Transient error
• Incomplete HTTP entity
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
Download the image on December 7, 2017
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchive
s.gov.uk/20170303010736id_/https:
//cereals.ahdb.org.uk/media/11578
42/corporate-strategy-1.jpg
WARC-Date: 2017-12-07T10:04:18Z
…
Content-Length: 459640
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 07 Dec 2017 10:04:18
GMT
…
The first
Content-length
should be bigger
than the second
one
WADL 2018, 2018-06-06 @maturban1
38
The complete HTTP entity
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchive
s.gov.uk/20170303010736id_/https:
//cereals.ahdb.org.uk/media/11578
42/corporate-strategy-1.jpg
WARC-Date: 2017-11-16T15:34:37Z
…
Content-Length: 643398
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 16 Nov 2017 15:34:36
GMT
…
This is what it
should look like
WADL 2018, 2018-06-06 @maturban1
39
http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
Requesting the raw version, received ”200 OK” with a
rewritten version that indicates “302 Redirect”
curl -I http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
HTTP/1.1 200 OK
Date: Tue, 05 Jun 2018 17:34:19 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux)
Content-Security-Policy: default-src 'self' style-src 'self' 'unsafe-inline'
Memento-Datetime: Wed, 13 Mar 2013 21:04:47 GMT
…
http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
40
http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
Requesting the raw version of webharvest.gov/congress110th/2008
1124195939id_/http://www.usda.gov/, it redirects to the live web
41
http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
Requesting the raw version of webharvest.gov/congress110th/2008
1124195939id_/http://www.usda.gov/, it redirects to the live web
curl -iL --silent webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ |
egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 301 Moved Permanently
Location: https://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
HTTP/1.1 301 Moved Permanently
Location:
https://www.webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
HTTP/1.1 302 Found
Location: http://www.usda.gov/wps/portal/usdahome
HTTP/1.1 301 Moved Permanently
Location: https://www.usda.gov/wps/portal/usdahome
location: https://www.usda.gov/
Requesting the raw version, a third party
service (Cloudflare) injects HTML code
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:45 GMT
<a href="/cdn-cgi/l/email-
protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464
54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1
207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” …
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:50 GMT
<a href="/cdn-cgi/l/email-
protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060
50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5
247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” …
curl -silent http://perma-
archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:51 GMT
<a href="/cdn-cgi/l/email-
protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d
4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8
39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …
43
Requirements for generating
repeatable hashes
1. Generate a hash on a composite memento
2. Exclude archive-specific resources
3. Avoid resources from the live web
4. Avoid content served from cache
5. Changes in TimeMaps might affect the
computation of hashes
6. Avoid including dynamic content
https://arxiv.org/pdf/1712.03140.pdf
WADL 2018, 2018-06-06 @maturban1
Aturban, M, Nelson, M.L., Weigle, M.C.: Difficulties of Timestamping
Archived Web Pages. Tech. Rep. arXiv:1712.03140 (2017)
44
Our study indicates that 28% of
mementos produce different hashes
• 17,074 archived page
• From 17 public web archives
• Downloaded 20 times
• Between November 16, 2017 and
March 27, 2018
WADL 2018, 2018-06-06 @maturban1
Preliminary work
45
The selected original pages (URI-Rs)
and mementos (URI-Ms)
Sources of URI-Rs:
• The HTTP Archive
(httparchive.org)
• The Web Archives for Historical
Research ( uwaterloo.ca/web-
archive-group/)
• Not all mementos are created
equal: measuring the impact of
missing resources, J. Brunelle et al.
(DOI: doi.org/10.1007/s0079)
WADL 2018, 2018-06-06 @maturban1
46
Selected Mementos (URI-Ms)
WADL 2018, 2018-06-06 @maturban1
47
Selected Mementos (URI-Ms)
WADL 2018, 2018-06-06 @maturban1
48
Four steps to generate a hash on
a memento
1. Download a memento by Headless Chrome
2. Write it in a WARC file by Squidwarc
github.com/N0taN3rd/Squidwarc
3. Extract all URI-Ms from the WARC file
4. Request the raw version of URI-Ms
5. Compute the final hash using Merkle tree
WADL 2018, 2018-06-06 @maturban1
49
(1) Download a memento with
Headless Chrome and (2) write it
in WARC file with Squidwarc
https://github.com/N0taN3rd/Squidwarc
web.archive.org/web/19961120150251 /http://www.usnews.com:80/
Download with Headless Chrome
with Squidwarc
WARC
WADL 2018, 2018-06-06 @maturban1
50
(3) Extract all URI-Ms from the
WARC file
Read WARC recordsby WARCIOWARC
file
https://github.com/webrecorder/warcio
WADL 2018, 2018-06-06 @maturban1
51
(4) Request the raw version of URI-M
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
✓
✓
✓
x Archive-specific resources x Not available or redirect
x
x
52
(4) Request the raw version of URI-M
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
✓
✓
✓
x Archive-specific resources x Not available or redirect
x
x
/web/19961120150251id_/http://www.usnews.com:80/
53
APIs to request the raw version
of URI-Ms
WADL 2018, 2018-06-06 @maturban1
54
(5) Generate the final hash by a Merkle tree
55
Within 5 months, 28% of mementos
produce different hashes because:
• Transient errors
• Dynamic URI-Ms
• Instability of available mementos
WADL 2018, 2018-06-06 @maturban1
Preliminary work
56
Archive
Archived
pages
Archived pages with
different hashes (%)
archive.org 1,600 1,027 (64%)
webarchive.loc.gov 1,600 821 (51%)
vefsafn.is 1,600 764 (48%)
arquivo.pt 1,600 305 (19%)
webcitation.org 1,600 57 (4%)
archive.is 1,600 0 (0%)
archive-it.org 1,407 489 (35%)
swap.stanford.edu 1,233 195 (16%)
nationalarchives.gov.uk 1,011 95 (9%)
europarchive.org 990 97 (10%)
webharvest.gov 733 178 (24)
digar.ee 518 81 (16%)
webarchive.proni.gov.uk 477 50 (10%)
webarchive.org.uk 362 275 (76%)
collectionscanada.gc.ca 359 13 (4%)
archive.bibalex.org 202 156 (77%)
perma-archives.org 182 147 (81%)
17,074 4,750 (28%)
Archived web pages with different
hashes per archive
57
Conclusions
• We downloaded 17,074 mementos 20 times
between November 16, 2017 and March 27, 2018
• Within the 5 months, we found that 28% of
mementos produce different hash values
• It is hard to get repeatable hashes on the playback of
mementos because of transient errors, dynamic URI-
Ms, and instability of indexes in archives
• We need an archive-aware hashing function to
produce repeatable hashes
WADL 2018, 2018-06-06 @maturban1

More Related Content

What's hot

Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
Web at 25 - Ontos Linked Open Data
Web at 25 - Ontos Linked Open DataWeb at 25 - Ontos Linked Open Data
Web at 25 - Ontos Linked Open DataAI4BD GmbH
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Humphrey Southall
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
 
Web Driven Revolution For Library Data
Web Driven Revolution For Library DataWeb Driven Revolution For Library Data
Web Driven Revolution For Library DataRichard Wallis
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublinm_ackermann
 
The Web of Data is Our Opportunity
The Web of Data is Our OpportunityThe Web of Data is Our Opportunity
The Web of Data is Our OpportunityRichard Wallis
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greaterCristina Sarasua
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataSebastian Hellmann
 
Achieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsAchieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsHerbert Van de Sompel
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesMichele Weigle
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 

What's hot (20)

Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Web at 25 - Ontos Linked Open Data
Web at 25 - Ontos Linked Open DataWeb at 25 - Ontos Linked Open Data
Web at 25 - Ontos Linked Open Data
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Assessing the performance of RDF Engines: Discussing RDF Benchmarks
Assessing the performance of RDF Engines: Discussing RDF Benchmarks Assessing the performance of RDF Engines: Discussing RDF Benchmarks
Assessing the performance of RDF Engines: Discussing RDF Benchmarks
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web Driven Revolution For Library Data
Web Driven Revolution For Library DataWeb Driven Revolution For Library Data
Web Driven Revolution For Library Data
 
Linked data life cycles
Linked data life cyclesLinked data life cycles
Linked data life cycles
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublin
 
The Web of Data is Our Opportunity
The Web of Data is Our OpportunityThe Web of Data is Our Opportunity
The Web of Data is Our Opportunity
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
 
Achieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsAchieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed Collections
 
Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...
 
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
 

Similar to It is hard to compute fixity on archived web pages

It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
 
Preserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading webPreserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading webMiel Vander Sande
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
 
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...Flink Forward
 
Northeastern DB Class Introduction to Marklogic NoSQL april 2016
Northeastern DB Class Introduction to Marklogic NoSQL april 2016Northeastern DB Class Introduction to Marklogic NoSQL april 2016
Northeastern DB Class Introduction to Marklogic NoSQL april 2016Matt Turner
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveKritika Garg
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
This Month in Data Science - April Edition
This Month in Data Science - April EditionThis Month in Data Science - April Edition
This Month in Data Science - April EditionVMware Tanzu
 
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Streaming Media West: Webrtc the future of low latency streaming
Streaming Media West: Webrtc the future of low latency streamingStreaming Media West: Webrtc the future of low latency streaming
Streaming Media West: Webrtc the future of low latency streamingAlexandre Gouaillard
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Aggregation of Linked Data A case study in the cultural heritage domain
Aggregation of Linked Data A case study in the cultural heritage domainAggregation of Linked Data A case study in the cultural heritage domain
Aggregation of Linked Data A case study in the cultural heritage domainNuno Freire
 
Micro:bit Workshop -- July 2018
Micro:bit Workshop -- July 2018Micro:bit Workshop -- July 2018
Micro:bit Workshop -- July 2018Hal Speed
 

Similar to It is hard to compute fixity on archived web pages (20)

It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Preserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading webPreserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading web
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache Flink
 
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling...
 
Northeastern DB Class Introduction to Marklogic NoSQL april 2016
Northeastern DB Class Introduction to Marklogic NoSQL april 2016Northeastern DB Class Introduction to Marklogic NoSQL april 2016
Northeastern DB Class Introduction to Marklogic NoSQL april 2016
 
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP Archive
 
Blockchains and databases a new era in distributed computing
Blockchains and databases a new era in distributed computingBlockchains and databases a new era in distributed computing
Blockchains and databases a new era in distributed computing
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
This Month in Data Science - April Edition
This Month in Data Science - April EditionThis Month in Data Science - April Edition
This Month in Data Science - April Edition
 
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Streaming Media West: Webrtc the future of low latency streaming
Streaming Media West: Webrtc the future of low latency streamingStreaming Media West: Webrtc the future of low latency streaming
Streaming Media West: Webrtc the future of low latency streaming
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Aggregation of Linked Data A case study in the cultural heritage domain
Aggregation of Linked Data A case study in the cultural heritage domainAggregation of Linked Data A case study in the cultural heritage domain
Aggregation of Linked Data A case study in the cultural heritage domain
 
Micro:bit Workshop -- July 2018
Micro:bit Workshop -- July 2018Micro:bit Workshop -- July 2018
Micro:bit Workshop -- July 2018
 

Recently uploaded

Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsCreative-Biolabs
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Production technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongenaProduction technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongenajana861314
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyHemantThakare8
 
Understanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfUnderstanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfHabibouKarbo
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxjana861314
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 

Recently uploaded (20)

Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative Biolabs
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Production technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongenaProduction technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongena
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiology
 
Understanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfUnderstanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdf
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptx
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 

It is hard to compute fixity on archived web pages

  • 1. It is hard to compute fixity on archived web pages Mohamed Aturban, Michael L. Nelson, Michele C. Weigle Web Science and Digital Libraries Research Group Old Dominion University, Norfolk, VA, 23529 Old Dominion University WADL 2018, June 6, 2018, Fort Worth, TX, USA Supported in part by The Andrew W. Mellon Foundation (AMF) grant 11600663 WADL 2018, 2018-06-06 @maturban1
  • 2. 2 Do archived pages change? Time climate.nasa.gov/vital-signs/carbon-dioxide/ Live Web t0 t9 t14 WADL 2018, 2018-06-06 @maturban1
  • 3. 3 Do archived pages change? TimeLive Web TimeArchive URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 t0 t2 t4 t6 t9 t14 t16 t18 WADL 2018, 2018-06-06 @maturban1
  • 4. 4 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 WADL 2018, 2018-06-06 @maturban1
  • 5. 5 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 WADL 2018, 2018-06-06 @maturban1 When replaying URI-M2 at different points in time, will we get the same content?
  • 6. 6 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 WADL 2018, 2018-06-06 @maturban1 When replaying URI-M2 at different points in time, will we get the same content?
  • 7. 7 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 WADL 2018, 2018-06-06 @maturban1 When replaying URI-M2 at different points in time, will we get the same content?
  • 8. 8 Do archived pages change? TimeLive Web TimeArchive Replay Time When replaying URI-M2 at different points in time, will we get the same content? URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 WADL 2018, 2018-06-06 @maturban1
  • 9. 9 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 WADL 2018, 2018-06-06 @maturban1 Our study shows that we are not always presented with the same archived content! ?
  • 10. 10 Cryptographic hashes to create fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change output My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018. SHA256 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019. SHA256 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf WADL 2018, 2018-06-06 @maturban1
  • 11. 11 This is what climate.nasa.gov/vital-signs/carbon-dioxide/ looks like right now
  • 12. 12 Generate hashes on a memento • Compute a hash value on the downloaded HTML content % curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d Compute SHA256 hash Download the page WADL 2018, 2018-06-06 @maturban1
  • 13. Time HTML content is downloaded e834 c71a efda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2 13 • Compare the current hash and the previous hash To verify fixity Hashes are NOT identical à the page has changed! http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  • 14. 14 What if an image has changed? Computing hashes on only HTML content will NOT detect changes WADL 2018, 2018-06-06 @maturban1
  • 15. 15 Potential solution: include all resources in hash calculation https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/ • 201 images • 19 JavaScript files • 3 CSS files • Main HTML file A single aggregated hash value www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page ) has A composite memento https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html Turns out it is hard to get repeatable hashes on composite mementos
  • 16. 16 Archives transform original content to appropriately replay mementos in a user’s browser • Add banners • Rewrite links to point to the archive, not to the live web • Modify HTML code to convey metadata • Apply some policies for security (e.g., block some content) • Provide the content in different format (e.g., ZIP and screenshots) Transformation examples:
  • 17. 17 Archives add banners • To convey information like the number of mementos and inform users that what they are viewing is from the archive • Banners change à different hashes Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos) http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
  • 18. 18 Archives rewrite links to embedded resources web.archive.org/web/19961120150251 /http://www.usnews.com:80/ http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
  • 19. 19 Live web resources linked from archives • Resources from the live web are expected to change à different hashes • Based on feedback from Lerner et al., IA solved this issue with Content- Security-Policy HTTP header, but the problem might still occur in other archives http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Archived in 2008 The ad is from 2012 This memento was replayed in 2012 A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer and Communications Security (CCS), pages 1741–1755, 2017.
  • 20. 20 Caches may temporarily hide changes in the playback % date Mon Oct 2 01:15:18 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:16:29 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:19:31 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 02:10:24 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 dda6a9bf091d412cbdc2226ce3eb1059 X-Page-Cache: MISS X-Page-Cache: HIT X-Page-Cache: MISS X-Page-Cache: HIT
  • 21. 21 Dynamic content by JS à different hashes WADL 2018, 2018-06-06 @maturban1
  • 22. 22 Dynamic content by JS à different hashes WADL 2018, 2018-06-06 @maturban1
  • 23. 23 Dynamic content by JS à different hashes WADL 2018, 2018-06-06 @maturban1
  • 24. 24 Dynamic content by JS à different hashes WADL 2018, 2018-06-06 @maturban1
  • 25. 25 Dynamic content by JS à different hashes A large number of mementos are unavailable WADL 2018, 2018-06-06 @maturban1
  • 26. 26 A resource selected randomly by JavaScript https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 29. 29 https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ A resource selected randomly by JavaScript function random_imglink(){ myimages[1]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg"; myimages[2]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannereagle.jpg"; myimages[3]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannertiger.jpg"; var ry=Math.floor(Math.random(1)*myimages.length) if (ry==0) ry=1 document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'" border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to Read"></a>') }
  • 30. 30 Changes in TimeMaps TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 A TimeMap = a list of available mementos = URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
  • 31. 31 The requested memento is unavailable TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 X WADL 2018, 2018-06-06 @maturban1
  • 32. 32 Mementos with the same content are not available too TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 XX X WADL 2018, 2018-06-06 @maturban1
  • 33. 33 URI-M2 redirects to other memento (URI-M4) which has different content TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 XX X WADL 2018, 2018-06-06 @maturban1 302 Redirect
  • 34. HTML content is downloaded d13a 247e 872e 11d5 64f4 b49a 24d8 275c a09f ee8d 48c0 0345 f458 5d4b 7ec3 e663 Hash HTML content is downloaded 55b5 6d82 7f98 f81e 3fc6 9e03 c0c1 f739 7fa4 0bff 4e36 0303 9ddd 50a2 6ae2 8229 Hash Novermber 2017 December 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 Changes in TimeMaps à different HTTP entity à different hashes URI-M1 was NOT available URI-M1 URI-M2 • You can see the difference in the URI-M of the main HTML file web.archive.org/web/20080828005922/http://www.evangelcogdayton.org/ web.archive.org/web/20090211151609/http://www.evangelcogdayton.org/
  • 35. December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 35 URI-M1 was NOT available URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G Changes in TimeMaps à different image à different hashes • You can't see the difference in the URI-M of the main HTML file, but you can see the difference in the embedded images https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
  • 36. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M2 = perma-archives.org/warc/20170619145458id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M1 was NOT available Different image Changes in TimeMaps à different image that looks the same à different hashes • You can't see the difference in the URI-M of the main HTML file nor the difference in the embedded images http://perma-archives.org/warc/20170101182813id_/http://umich.edu/ http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
  • 37. 37 Transient error • Incomplete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg Download the image on December 7, 2017 WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-12-07T10:04:18Z … Content-Length: 459640 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 07 Dec 2017 10:04:18 GMT … The first Content-length should be bigger than the second one WADL 2018, 2018-06-06 @maturban1
  • 38. 38 The complete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-11-16T15:34:37Z … Content-Length: 643398 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 16 Nov 2017 15:34:36 GMT … This is what it should look like WADL 2018, 2018-06-06 @maturban1
  • 39. 39 http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/ Requesting the raw version, received ”200 OK” with a rewritten version that indicates “302 Redirect” curl -I http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/ HTTP/1.1 200 OK Date: Tue, 05 Jun 2018 17:34:19 GMT Server: Apache/2.4.6 (Red Hat Enterprise Linux) Content-Security-Policy: default-src 'self' style-src 'self' 'unsafe-inline' Memento-Datetime: Wed, 13 Mar 2013 21:04:47 GMT … http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
  • 40. 40 http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ Requesting the raw version of webharvest.gov/congress110th/2008 1124195939id_/http://www.usda.gov/, it redirects to the live web
  • 41. 41 http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ Requesting the raw version of webharvest.gov/congress110th/2008 1124195939id_/http://www.usda.gov/, it redirects to the live web curl -iL --silent webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ | egrep -i "(HTTP/1.1|^location:)" HTTP/1.1 301 Moved Permanently Location: https://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ HTTP/1.1 301 Moved Permanently Location: https://www.webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ HTTP/1.1 302 Found Location: http://www.usda.gov/wps/portal/usdahome HTTP/1.1 301 Moved Permanently Location: https://www.usda.gov/wps/portal/usdahome location: https://www.usda.gov/
  • 42. Requesting the raw version, a third party service (Cloudflare) injects HTML code curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:45 GMT <a href="/cdn-cgi/l/email- protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464 54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1 207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” … curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:50 GMT <a href="/cdn-cgi/l/email- protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060 50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5 247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” … curl -silent http://perma- archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:51 GMT <a href="/cdn-cgi/l/email- protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d 4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8 39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …
  • 43. 43 Requirements for generating repeatable hashes 1. Generate a hash on a composite memento 2. Exclude archive-specific resources 3. Avoid resources from the live web 4. Avoid content served from cache 5. Changes in TimeMaps might affect the computation of hashes 6. Avoid including dynamic content https://arxiv.org/pdf/1712.03140.pdf WADL 2018, 2018-06-06 @maturban1 Aturban, M, Nelson, M.L., Weigle, M.C.: Difficulties of Timestamping Archived Web Pages. Tech. Rep. arXiv:1712.03140 (2017)
  • 44. 44 Our study indicates that 28% of mementos produce different hashes • 17,074 archived page • From 17 public web archives • Downloaded 20 times • Between November 16, 2017 and March 27, 2018 WADL 2018, 2018-06-06 @maturban1 Preliminary work
  • 45. 45 The selected original pages (URI-Rs) and mementos (URI-Ms) Sources of URI-Rs: • The HTTP Archive (httparchive.org) • The Web Archives for Historical Research ( uwaterloo.ca/web- archive-group/) • Not all mementos are created equal: measuring the impact of missing resources, J. Brunelle et al. (DOI: doi.org/10.1007/s0079) WADL 2018, 2018-06-06 @maturban1
  • 46. 46 Selected Mementos (URI-Ms) WADL 2018, 2018-06-06 @maturban1
  • 47. 47 Selected Mementos (URI-Ms) WADL 2018, 2018-06-06 @maturban1
  • 48. 48 Four steps to generate a hash on a memento 1. Download a memento by Headless Chrome 2. Write it in a WARC file by Squidwarc github.com/N0taN3rd/Squidwarc 3. Extract all URI-Ms from the WARC file 4. Request the raw version of URI-Ms 5. Compute the final hash using Merkle tree WADL 2018, 2018-06-06 @maturban1
  • 49. 49 (1) Download a memento with Headless Chrome and (2) write it in WARC file with Squidwarc https://github.com/N0taN3rd/Squidwarc web.archive.org/web/19961120150251 /http://www.usnews.com:80/ Download with Headless Chrome with Squidwarc WARC WADL 2018, 2018-06-06 @maturban1
  • 50. 50 (3) Extract all URI-Ms from the WARC file Read WARC recordsby WARCIOWARC file https://github.com/webrecorder/warcio WADL 2018, 2018-06-06 @maturban1
  • 51. 51 (4) Request the raw version of URI-M x x x x x x x x x x x x x x x x ✓ ✓ ✓ x Archive-specific resources x Not available or redirect x x
  • 52. 52 (4) Request the raw version of URI-M x x x x x x x x x x x x x x x x ✓ ✓ ✓ x Archive-specific resources x Not available or redirect x x /web/19961120150251id_/http://www.usnews.com:80/
  • 53. 53 APIs to request the raw version of URI-Ms WADL 2018, 2018-06-06 @maturban1
  • 54. 54 (5) Generate the final hash by a Merkle tree
  • 55. 55 Within 5 months, 28% of mementos produce different hashes because: • Transient errors • Dynamic URI-Ms • Instability of available mementos WADL 2018, 2018-06-06 @maturban1 Preliminary work
  • 56. 56 Archive Archived pages Archived pages with different hashes (%) archive.org 1,600 1,027 (64%) webarchive.loc.gov 1,600 821 (51%) vefsafn.is 1,600 764 (48%) arquivo.pt 1,600 305 (19%) webcitation.org 1,600 57 (4%) archive.is 1,600 0 (0%) archive-it.org 1,407 489 (35%) swap.stanford.edu 1,233 195 (16%) nationalarchives.gov.uk 1,011 95 (9%) europarchive.org 990 97 (10%) webharvest.gov 733 178 (24) digar.ee 518 81 (16%) webarchive.proni.gov.uk 477 50 (10%) webarchive.org.uk 362 275 (76%) collectionscanada.gc.ca 359 13 (4%) archive.bibalex.org 202 156 (77%) perma-archives.org 182 147 (81%) 17,074 4,750 (28%) Archived web pages with different hashes per archive
  • 57. 57 Conclusions • We downloaded 17,074 mementos 20 times between November 16, 2017 and March 27, 2018 • Within the 5 months, we found that 28% of mementos produce different hash values • It is hard to get repeatable hashes on the playback of mementos because of transient errors, dynamic URI- Ms, and instability of indexes in archives • We need an archive-aware hashing function to produce repeatable hashes WADL 2018, 2018-06-06 @maturban1