Establishing and Verifying Fixity of Archived Web Pages

M
Establishing and Verifying
Fixity of Archived Web
Pages
Mohamed Aturban
Old Dominion University
Advisors:
Dr. Michele C. Weigle and Dr. Michael L. Nelson
JCDL 2018 Doctoral Consortium
June 3, 2018
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
2
This is what climate.nasa.gov/vital-signs/carbon-dioxide/
looks like right now
3
The Internet Archive allows us to view
previous versions (mementos) of that page
4
http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-dioxide/
https://web.archive.org/web/20160708040004/https://climate.nasa.gov/vital-signs/carbon-dioxide/
An archived page (memento) from July 2016
6
The page in other web archives
for a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
Typical archive URI construction:
archive.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide
(2,782)
(48)
(3)
(12)
(0)
(0)
(3)
The number of mementos
available in the archive
7
What if we checked these archives?
What if they all agree?
Would you trust the results?
climate.nasa.gov/vital-signs/carbon-dioxide/
climate.nasa.gov/vital-signs/carbon-dioxide/
climate.nasa.gov/vital-signs/carbon-dioxide/
climate.nasa.gov/vital-signs/carbon-dioxide/
michaelsevilwayback/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
8
The web page is archived by Michael’s Evil
Wayback in July 2017
Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
9
Replaying the same memento in October 2017,
we got a different CO2
Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
10
Which one is the real memento?
July 2017 October 2017
• How to ensure that a memento has remained unaltered
since the time it was captured by the archive?
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
It is important to verify fixity of
archived resources
• Web archives will be the only evidence of what was in the live web
• For example, The Data Refuge project is an attempt to preserve
federal climate and environmental data
- But in the future, how to verify archived copies in this specific
archive have remained unchanged
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
12
Do archived pages change?
Time
climate.nasa.gov/vital-signs/carbon-dioxide/
Live
Web
t0 t9 t14
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
13
Do archived pages change?
TimeLive
Web
TimeArchive URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
t0 t2 t4 t6 t9 t14 t16 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
14
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
15
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
When replaying URI-M2 at different points in
time, will we get the same content?
16
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
When replaying URI-M2 at different points in
time, will we get the same content?
17
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
When replaying URI-M2 at different points in
time, will we get the same content?
18
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
When replaying URI-M2 at different points in
time, will we get the same content?
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
19
Do archived pages change?
TimeLive
Web
TimeArchive
Replay Time
Our study shows that we are not always
presented with the same archived content!
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
20
Cryptographic hashes to create
fixity information
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input à a large change in the output
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2018.
SHA256
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2019.
SHA256
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
SimHash:
A small change in the input à
a small change in the output
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2018.
SimHash
ed646a9efbc77705
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2019.
SimHash
ed646a9efbc77305
https://github.com/leonsim/simhash
• We can not use SimHash on archived pages because small changes matter
22
SimHash:
large changes in the input à
large changes in the output
My name is Mohamed Aturban, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 18th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2018.
SimHash
ed646a9efbc77705
My name is Sawood Alam, a
graduate student in the
Department of Computer Science
at Old Dominion University. I
am attending the 19th ACM/IEEE
Joint Conference on Digital
Libraries (JCDL), 2019.
SimHash
ed666bdefbc77205
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
23
Is there an existing hashing
function suitable for mementos?
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Web archive Hash value
?
an archive-aware
hashing function
24
Generate hashes on a web page
• Compute a hash value on the downloaded HTML content
% curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/
| shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d
Compute SHA256 hash
Download the page
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Hash
HTML
content is
downloaded
fc90 88b3 a614 a588 40bd 5387 d93c 16be
824c d2bb b3fa b173 f93f a57d 241a 3790
SHA256
Hash
August 2017
October 2017
The archived page has been tampered with by changing the value of COSeptember 2017
2
25
• Compare the current hash and the previous hash
To verify fixity
Hashes are NOT identical à the page has changed!
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
26
What if an image has changed?
Computing hashes on only HTML content will
NOT detect changes
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
27
Potential solution: include all
resources in hash calculation
https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/
• 201 images
• 19 JavaScript files
• 3 CSS files
• Main HTML file
A single aggregated
hash value
www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page )
has
A composite memento
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
Turns out it is hard to get
repeatable hashes on
composite mementos
28
What is a composite memento?
“A composite memento is a root resource
such as an HTML web page and all of the
embedded resources (images, CSS, etc.)
required for a complete presentation”
http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
WADL 2018, 2018-06-06 @maturban1
29
Research Questions
R.Q.1 How can we construct an archive-
aware hashing function for generating
repeatable fixity information on archived
web pages?
R.Q.2 How to develop an approach to use
this information to verify fixity and detect
changes in archived resources over time?
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
30
Dissertation plan
Literature review
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
31
Related Work
TRAC (2007)
Establishing trusted archives
- TRAC not for playback
Lerner et al. (2017)
Vulnerabilities
- Discovered four vulnerabilities in the
Internet Archive’s Wayback Machine
J. Cushman et al. (2017)
More potential threats
- Demonstrate potential threats in
web archives
Rosenthal et al. (2005)
Threats
- Described several threats against
digital preservation systems
Juan Benet (2017)
Multihash
- Self identifying hashes for IPFS
OpenTimestamps, OriginStamp,
Gipp (2015, 2016), and
Chainpoint
Trusted timestamps in
Blockchain
- Not suitable for composite
mementos
T. Kuhn et al. (2014)
Trusty URI
- A URI that contains a hash value of
the content it identifies
P. Maniatis et al. (2005)
Distributed copies of
archived resources
- The scope and content are
narrowly defined
32
Dissertation plan
Literature review
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
33
Archives transform original content to
appropriately replay mementos in a user’s
browser
• Add banners
• Rewrite links to point to the archive, not to the
live web
• Modify HTML code to convey metadata
• Apply some policies for security (e.g., block
some content)
• Provide the content in different format (e.g., ZIP
and screenshots)
Transformation examples:
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
34
Archives add banners
• To convey information like the number of mementos and
inform users that what they are viewing is from the archive
• Banners change à different hashes
Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos)
http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
35
Archives rewrite links to embedded
resources
web.archive.org/web/19961120150251 /http://www.usnews.com:80/
http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
36
Live web resources linked from archives
• Resources from the live web are expected to change à different hashes
• Based on feedback from Lerner et al., IA solved this issue with Content-
Security-Policy HTTP header, but the problem might still occur in other archives
http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Archived in 2008
The ad is from 2012
This memento was
replayed in 2012
A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer
and Communications Security (CCS), pages 1741–1755, 2017.
37
Caches will temporarily hide changes
in the playback à different hashes
% date Mon Oct 2 01:15:18 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 01:16:29 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 01:19:31 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 02:10:24 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
dda6a9bf091d412cbdc2226ce3eb1059
X-Page-Cache: MISS
X-Page-Cache: HIT
X-Page-Cache: MISS
X-Page-Cache: HIT
38
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Dynamic content by JS à different hashes
39
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Dynamic content by JS à different hashes
40
Dynamic content by JS à different hashes
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
41
Dynamic content by JS à different hashes
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
A large number of
mementos are
unavailable
42
A resource selected randomly by
JavaScript
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
43
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
44
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
45
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
function random_imglink(){
myimages[1]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg";
myimages[2]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannereagle.jpg";
myimages[3]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannertiger.jpg";
var ry=Math.floor(Math.random(1)*myimages.length)
if (ry==0)
ry=1
document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img
src="'+myimages[ry]+'" border="0" alt="The Open Spaces
Blog. A Talk on the Wild Side. Click to Read"></a>')
}
46
Changes in TimeMaps
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
A TimeMap = a list of available mementos =
URI-M1
URI-M2
URI-M3
URI-M4
URI-M5
47
The requested memento is unavailable
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
X
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
48
Mementos with the same content
are not available either
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
X
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
X X
49
URI-M2 redirects to other memento (URI-M4)
which has different content
TimeLive
Web
TimeArchive
Replay Time
URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
URI-M2 URI-M2 URI-M2
t0 t2 t4 t5 t6 t9 t14 t16 t17 t18
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
XX X
302 Redirect
HTML
content is
downloaded
d13a 247e 872e 11d5 64f4 b49a 24d8 275c
a09f ee8d 48c0 0345 f458 5d4b 7ec3 e663
Hash
HTML
content is
downloaded
55b5 6d82 7f98 f81e 3fc6 9e03 c0c1 f739
7fa4 0bff 4e36 0303 9ddd 50a2 6ae2 8229
Hash
Novermber 2017
December 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
50
Changes in TimeMaps à different hashes
URI-M1 was
NOT available
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
web.archive.
org/web/2008
0828005922/h
ttp://www.ev
angelcogdayt
on.org/
web.archive.
org/web/2009
0211151609/h
ttp://www.ev
angelcogdayt
on.org/
URI-M 1
URI-M 2
51
Transient error
• Incomplete HTTP entity
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
Download the image on December 7, 2017
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchive
s.gov.uk/20170303010736id_/https:
//cereals.ahdb.org.uk/media/11578
42/corporate-strategy-1.jpg
WARC-Date: 2017-12-07T10:04:18Z
…
Content-Length: 459640
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 07 Dec 2017 10:04:18
GMT
…
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
The first
Content-length
should be bigger
than the second
one
52
The complete HTTP entity
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchive
s.gov.uk/20170303010736id_/https:
//cereals.ahdb.org.uk/media/11578
42/corporate-strategy-1.jpg
WARC-Date: 2017-11-16T15:34:37Z
…
Content-Length: 643398
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 16 Nov 2017 15:34:36
GMT
…
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
This is what it
should look like
53
Requirements for an archive-aware
hashing function
1. Generate a hash on a composite memento
2. Exclude archive-specific resources
3. Avoid resources from the live web
4. Avoid content served from cache
5. Changes in TimeMaps might affect the
computation of hashes
6. Avoid including dynamic content
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
54
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
55
Our study indicates 28% of
mementos produce different hashes
• 17,074 archived page
• From 17 public web archives
• Downloaded 20 times
• Between November 16, 2017 and
March 27, 2018
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Preliminary work
56
Archive Mementos
Mementos with
different hashes (%)
archive.org 1,600 1,027 (64%)
webarchive.loc.gov 1,600 821 (51%)
vefsafn.is 1,600 764 (48%)
arquivo.pt 1,600 305 (19%)
webcitation.org 1,600 57 (4%)
archive.is 1,600 0 (0%)
archive-it.org 1,407 489 (35%)
swap.stanford.edu 1,233 195 (16%)
nationalarchives.gov.uk 1,011 95 (9%)
europarchive.org 990 97 (10%)
webharvest.gov 733 178 (24)
digar.ee 518 81 (16%)
webarchive.proni.gov.uk 477 50 (10%)
webarchive.org.uk 362 275 (76%)
collectionscanada.gc.ca 359 13 (4%)
archive.bibalex.org 202 156 (77%)
perma-archives.org 182 147 (81%)
17,074 4,750 (28%)
Mementos with different
hashes per archive
57
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
An Approach for Verifying Fixity of
Archived Web Resources
58
Use web archives to monitor web archives
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Step 1: Push to multiple archives
59
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Step 2: Compute fixity,
publish fixity “manifest” at a well-known location
60
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
• Compute fixity on things that should not change, like JPEGs and certain original
HTTP response headers.
• This example assumes the existence of a well-known server manifest.org.
• Actual URIs can be a bit more complex using “Trusty URIs”: http://ws-
dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Wondering about veracity of an archived page?
Check manifest.org and recompute fixity.
61
climate.nasa.gov climate.nasa.gov
We can’t know archive.org did not alter contents on ingest (20180321),
but we can verify that it has not changed since our observation (20180322)
What if manifest.org is
down? or possibly
hacked?
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Step 3: Push manifest to multiple archives
62
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
• Now the 20180322 version of the manifest of archive.org’s memento
of climate.nasa.gov is in four different archives.
• The URIs are more complex, but the bottom line is an attacker would
have to hack a majority of 5 domains (manifest.org + 4 archives)
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Wondering about veracity of an archived
page? Check all copies of manifest.org and
take a majority vote
63
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
climate.nasa.gov
• Caveat 1: If I can hack nasa.gov page at archive.org, I can probably
hack the fixity info there too, so we really have 4 copies not 5.
• Caveat 2: archive.org and archive-it.org are not independent,
• so we really have 3 copies not 5.
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
64
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
Probability model of rendering
the same archived pages
65
- Download a number of mementos multiple
times at different points in time
- Compute a hash on each memento after
each download
- Try to find an answer to the question: what
is the probability of getting the same hash
value?
66
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
67
- Open Annotation Data Model (OA)
- Open Archives Initiative: Object Reuse and
Exchange (OAI-ORE)
- Linked Data Platform (LDP)
- Open Annotation Protocol
- WAT/WET
- BagIt
Study different existing data models for
serializing fixity information and select one
68
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
69
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
PreliminaryworkFuturework
When can fixity information be
generated?
70
1. By the archive on ingest
2. At any time after the archived
content is made available
time
• A web page is captured by the archive at t0
1
t0 t7
1
2
• By , we can trust the archived page since t0
• By , we can trust the archived page since t7. but we
cannot detect changes before t7
1
2
2
Is our proposed
framework scalable?
71
• Time required to generate fixity
information per memento per archive
• Time required to verify fixity of
mementos
• The size of fixity information ”manifest”
• Scalable? Or just on important
mementos (e.g., the archived NASA
page)
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Only independent copies of manifest
should be counted
72
• Copies in Internet Archives and Archive-it
are not independent
• Copies of copies are not independent
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
73
Dissertation plan
Read Literature
Identify types of changes and define requirements for generating
repeatable hashes
Build a probability model of rendering the same archived pages
PhD Defense
Study different existing data models for serializing fixity information
and select one
Implement different services to:
- Generate fixity information
- Publish fixity information on the web
- Verify archived resources
Evaluate the framework
Finish writing the dissertation
Define the framework’s structure of verifying fixity
Analyze the same archived pages downloaded at different times
[ Current State ]
Preliminarywork
July 2018
November 2018
August 2018
February 2019
May 2019
Extra slides
74
75
Creating trusted archives
JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
Trustworthy Repositories Audit & Certification: Criteria and Checklist, Version
1.0, The Center for Research Libraries and OCLC Online Computer Library
Center, Inc. (2007). ( Sections B2.9 and B4.4 )
• Web archives must create preservation
metadata that can be used to verify fixity
• Preserved content should be stored
separately from fixity information
( hard for someone to alter both )
Related Work
76
Threats
• D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, and S. Morabito. Transparent Format Migration of Preserved Web Content. D-Lib Magazine, 11(1), 2005.
• A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on
Computer and Communications Security (CCS), pages 1741–1755, 2017.
• J. Cushman and I. Kreymer. Thinking like a hacker: Security Considerations for High-Fidelity Web Archives.
http://labs.rhizome.org/presentations/security.html, May 2017.
• Rosenthal et al. described several threats against digital preservation
systems:
• Lerner et al. discovered four vulnerabilities in the Internet Archive’s
Wayback Machine (i.e., Archive-Escapes, Same-Origin Escapes, Archive-
Escapes + Same-Origin Escapes, and Anachronism- Injection) that
attackers can leverage to modify a user’s view in a browser
• Cushman and Kreymer created a shared repository in May 2017 to
demonstrate potential threats in web archives (e.g., controlling a user’s
account due to Cross-Site Request Forgery (CSRF) or archived web
resources reaching out to the live web)
- Media failure, software failure, failure of network
services, natural disaster, internal attack, organizational
failure, hardware failure, …
Related Work
77
Hashes in URIs
• T. Kuhn and M. Dumontier. Trusty uris: Verifiable, immutable, and permanent digital artifacts for linked data. In European Semantic Web Conference,
pages 395–410. Springer, 2014.
• T. Kuhn and M. Dumontier. Making digital artifacts on the web verifiable and reliable. IEEE Transactions on Knowledge and Data Engineering,
27(9):2390–2400, 2015.
• Multihash. https://github.com/multiformats/multihash.
• Kuhn et al. define a Trusty URI as a URI that contains a cryptographic hash
value of the content it identifies. Trusty URIs can be generated on only two
types of content RDF graphs and byte-level content (i.e., no modules introduced
for HTML documents).
Related Work
• Juan Benet introduced Multihash to mainly create self identifying hashes for IPFS
content
78
Trusted timestamps in
blockchain-based networks
• https://originstamp.org
• https://chainpoint.org/
• https://opentimestamps.org/
• A. Wright and P. De Filippi. Decentralized blockchain technology and the rise of lex cryp- tographia. In SSRN Electronic Journal, 2015.
• OpenTimestamps, OriginStamp, and Chainpoint generate
trusted timestamps using Bitcoin blockchain.
Related Work
File
• The common steps for timestamping:
1. Receiving a file, a hash, or plain text from a user
2. Generating a hash value of received content
3. Converting the hash to a Bitcoin address
4. Issuing a Bitcoin transaction using the Bitcoin address
• The timestamp associated with the transaction is used as
a trusted timestamp
79
Distributed copies of archived
resources
• P. Maniatis, M. Roussopoulos, T. J. Giuli, D. S. Rosenthal, and M. Baker. The lockss peer-to-peer digital preservation system. ACM Transactions on
Computer Systems (TOCS), 23(1):2–50, 2005.
• B. Kahle. Help Us Keep the Archive Free, Accessible, and Reader Private. https: //blog.archive.org/2016/11/29/help-us-keep-the-archive-free-
accessible-and- private/, November 2016.
Related Work
- Built so that each participating library has its own copy of
scholarly papers
- LOCKSS regularly compares these copies and detects
corrupted ones based on voting on the cryptographic
hash of the content
- Replacing any corrupted copy with the right content
• The Internet Archive are planning to build a new archive
in Canada to duplicate all current archived collections.
• Lots of Copies Keep Stuff Safe (LOCKSS)
December 2017
March 2018
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
80
URI-M1 was
NOT available
URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
Changes in TimeMaps à different image à different hashes
• You can't see the difference in the URI-M of the main HTML file, but you
can see the difference in the embedded images
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
December 12, 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2December 25, 2017
URI-M1 = perma-archives.org/warc/20170101182814id_/
http://umich.edu/includes/image/type/gallery/id/113/name/Resea
rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M2 = perma-archives.org/warc/20170619145458id_/
http://umich.edu/includes/image/type/gallery/id/113/name/Resea
rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M1 was
NOT available
Different image
Changes in TimeMaps à different image that looks the same à different hashes
• You can't see the difference in the URI-M of the main HTML file nor the
difference in the embedded images
http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
82
http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
Requesting the raw version, received ”200 OK” with a
rewritten version that indicates “302 Redirect”
curl -I http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
HTTP/1.1 200 OK
Date: Tue, 05 Jun 2018 17:34:19 GMT
Server: Apache/2.4.6 (Red Hat Enterprise Linux)
Content-Security-Policy: default-src 'self' style-src 'self' 'unsafe-inline'
Memento-Datetime: Wed, 13 Mar 2013 21:04:47 GMT
…
http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
83
http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
Requesting the raw version of webharvest.gov/congress110th/2008
1124195939id_/http://www.usda.gov/, it redirects to the live web
84
http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
Requesting the raw version of webharvest.gov/congress110th/2008
1124195939id_/http://www.usda.gov/, it redirects to the live web
curl -iL --silent webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ |
egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 301 Moved Permanently
Location: https://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
HTTP/1.1 301 Moved Permanently
Location:
https://www.webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/
HTTP/1.1 302 Found
Location: http://www.usda.gov/wps/portal/usdahome
HTTP/1.1 301 Moved Permanently
Location: https://www.usda.gov/wps/portal/usdahome
location: https://www.usda.gov/
Requesting the raw version, a third party
service (Cloudflare) injects HTML code
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:45 GMT
<a href="/cdn-cgi/l/email-
protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464
54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1
207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” …
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:50 GMT
<a href="/cdn-cgi/l/email-
protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060
50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5
247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” …
curl -silent http://perma-
archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:51 GMT
<a href="/cdn-cgi/l/email-
protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d
4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8
39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …
1 of 85

Recommended

Enabling Personal Use of Web Archives by
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesMichele Weigle
2.9K views92 slides
A Framework for Aggregating Private and Public Web Archives by
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archivesjcdl2018
67 views84 slides
It is hard to compute fixity on archived web pages by
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
1.3K views57 slides
Bootstrapping Web Archive Collections of Stories from Micro-collections in S... by
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...Alexander Nwala
1.1K views45 slides
Client-Assisted Memento Aggregation Using the Prefer Header by
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
1.2K views22 slides
A Framework for Verifying the Fixity of Archived Web Resources by
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resourcesmaturban
460 views140 slides

More Related Content

What's hot

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing by
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
940 views108 slides
WS-DL’s Work towards Enabling Personal Use of Web Archives by
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesMichele Weigle
589 views92 slides
Archive Assisted Archival Fixity Verification Framework by
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
1.3K views31 slides
Signposting for Repositories by
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
798 views34 slides
Discovering Scholarly Orphans Using ORCID by
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
1.8K views40 slides
Paul Evan Peters Lecture by
Paul Evan Peters LecturePaul Evan Peters Lecture
Paul Evan Peters LectureHerbert Van de Sompel
6.1K views96 slides

What's hot(20)

MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing by Sawood Alam
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
Sawood Alam940 views
WS-DL’s Work towards Enabling Personal Use of Web Archives by Michele Weigle
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
Michele Weigle589 views
Archive Assisted Archival Fixity Verification Framework by Sawood Alam
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
Sawood Alam1.3K views
Signposting for Repositories by Martin Klein
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
Martin Klein798 views
Discovering Scholarly Orphans Using ORCID by Martin Klein
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
Martin Klein1.8K views
Summarize Your Archival Holdings With MementoMap by Sawood Alam
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
Sawood Alam73 views
Creating Topical Collections: Web Archives vs. Live Web by Martin Klein
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
Martin Klein1.5K views
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations by Justin Brunelle
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Justin Brunelle1.9K views
MementoMap Framework for Flexible and Adaptive Web Archive Profiling by Sawood Alam
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Sawood Alam2.2K views
Robust Linking to Web Resources by Martin Klein
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
Martin Klein1.8K views
To the Rescue of the Orphans of Scholarly Communication by Martin Klein
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
Martin Klein1.9K views
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript by Michael Nelson
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Michael Nelson3.6K views
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen... by Europeana
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
Europeana40 views
The Many Shapes of Archive-It by Shawn Jones
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-It
Shawn Jones1.3K views
Impact of URI Canonicalization on Memento Count by Mat Kelly
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
Mat Kelly1.5K views

Similar to Establishing and Verifying Fixity of Archived Web Pages

It is hard to compute fixity on archived web pages by
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
262 views44 slides
Preserving a Web of Linked Data: Lessons and challenges from a fading web by
Preserving a Web of Linked Data: Lessons and challenges from a fading webPreserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading webMiel Vander Sande
40 views60 slides
Lessons Learned From the Longitudinal Sampling of a Large Web Archive by
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveKritika Garg
149 views22 slides
DHUG 2018: Towards Web-Centric Repository Interoperability by
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
240 views67 slides
Readying Web Archives to Consume and Leverage Web Bundles by
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
246 views20 slides
Fluent 2018: Tracking Performance of the Web with HTTP Archive by
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP ArchivePaul Calvano
2.2K views52 slides

Similar to Establishing and Verifying Fixity of Archived Web Pages(20)

It is hard to compute fixity on archived web pages by maturban
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
maturban262 views
Preserving a Web of Linked Data: Lessons and challenges from a fading web by Miel Vander Sande
Preserving a Web of Linked Data: Lessons and challenges from a fading webPreserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading web
Lessons Learned From the Longitudinal Sampling of a Large Web Archive by Kritika Garg
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Kritika Garg149 views
Readying Web Archives to Consume and Leverage Web Bundles by Sawood Alam
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
Sawood Alam246 views
Fluent 2018: Tracking Performance of the Web with HTTP Archive by Paul Calvano
Fluent 2018: Tracking Performance of the Web with HTTP ArchiveFluent 2018: Tracking Performance of the Web with HTTP Archive
Fluent 2018: Tracking Performance of the Web with HTTP Archive
Paul Calvano2.2K views
Oggcamp Fast and Beautiful Images by Doug Sillars
Oggcamp Fast and Beautiful ImagesOggcamp Fast and Beautiful Images
Oggcamp Fast and Beautiful Images
Doug Sillars289 views
High-Throughput Sunflowers: A Genomics Case Study in Building Amazon S3 Data ... by Amazon Web Services
High-Throughput Sunflowers: A Genomics Case Study in Building Amazon S3 Data ...High-Throughput Sunflowers: A Genomics Case Study in Building Amazon S3 Data ...
High-Throughput Sunflowers: A Genomics Case Study in Building Amazon S3 Data ...
Detecting Off-Topic Web Pages at #CUWARC by Michele Weigle
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
Michele Weigle2.4K views
From Academic Library 2.0 to (Literature) Research 2.0 by Michael Habib
From Academic Library 2.0  to (Literature) Research 2.0From Academic Library 2.0  to (Literature) Research 2.0
From Academic Library 2.0 to (Literature) Research 2.0
Michael Habib2.4K views
Enhancing a library OPAC with linked data by Michael Cummings
Enhancing a library OPAC with linked dataEnhancing a library OPAC with linked data
Enhancing a library OPAC with linked data
Michael Cummings1.8K views
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ... by Robert Meusel
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Robert Meusel1.8K views
Turin webperf meetup by Doug Sillars
Turin webperf meetupTurin webperf meetup
Turin webperf meetup
Doug Sillars392 views
Information sharing about Columbia University Library’s recent web archiving ... by Anna Perricci
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...
Anna Perricci657 views
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... by Chris Bizer
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Chris Bizer933 views
apidays LIVE Helsinki & North - 20 minutes to build a serverless COVID-19 RES... by apidays
apidays LIVE Helsinki & North - 20 minutes to build a serverless COVID-19 RES...apidays LIVE Helsinki & North - 20 minutes to build a serverless COVID-19 RES...
apidays LIVE Helsinki & North - 20 minutes to build a serverless COVID-19 RES...
apidays548 views
Let's understand Data Science by Sachin Rastogi
Let's understand Data Science Let's understand Data Science
Let's understand Data Science
Sachin Rastogi154 views

Recently uploaded

Nitrosamine & NDSRI.pptx by
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptxNileshBonde4
18 views22 slides
RemeOs science and clinical evidence by
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidencePetrusViitanen1
53 views96 slides
Experimental animal Guinea pigs.pptx by
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptxMansee Arya
38 views16 slides
scopus cited journals.pdf by
scopus cited journals.pdfscopus cited journals.pdf
scopus cited journals.pdfKSAravindSrivastava
12 views15 slides
Disinfectants & Antiseptic by
Disinfectants & AntisepticDisinfectants & Antiseptic
Disinfectants & AntisepticSanket P Shinde
66 views36 slides
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Trustlife
142 views17 slides

Recently uploaded(20)

Nitrosamine & NDSRI.pptx by NileshBonde4
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptx
NileshBonde418 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen153 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya38 views
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by Trustlife
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Trustlife142 views
Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri17 views
Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank27 views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI8 views
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz14 views
ELECTRON TRANSPORT CHAIN by DEEKSHA RANI
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAIN
DEEKSHA RANI10 views
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by Trustlife
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Trustlife112 views

Establishing and Verifying Fixity of Archived Web Pages

  • 1. Establishing and Verifying Fixity of Archived Web Pages Mohamed Aturban Old Dominion University Advisors: Dr. Michele C. Weigle and Dr. Michael L. Nelson JCDL 2018 Doctoral Consortium June 3, 2018 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 2. 2 This is what climate.nasa.gov/vital-signs/carbon-dioxide/ looks like right now
  • 3. 3 The Internet Archive allows us to view previous versions (mementos) of that page
  • 6. 6 The page in other web archives for a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml Typical archive URI construction: archive.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide (2,782) (48) (3) (12) (0) (0) (3) The number of mementos available in the archive
  • 7. 7 What if we checked these archives? What if they all agree? Would you trust the results? climate.nasa.gov/vital-signs/carbon-dioxide/ climate.nasa.gov/vital-signs/carbon-dioxide/ climate.nasa.gov/vital-signs/carbon-dioxide/ climate.nasa.gov/vital-signs/carbon-dioxide/ michaelsevilwayback/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 8. 8 The web page is archived by Michael’s Evil Wayback in July 2017 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 9. 9 Replaying the same memento in October 2017, we got a different CO2 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 10. 10 Which one is the real memento? July 2017 October 2017 • How to ensure that a memento has remained unaltered since the time it was captured by the archive? JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
  • 11. It is important to verify fixity of archived resources • Web archives will be the only evidence of what was in the live web • For example, The Data Refuge project is an attempt to preserve federal climate and environmental data - But in the future, how to verify archived copies in this specific archive have remained unchanged JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 12. 12 Do archived pages change? Time climate.nasa.gov/vital-signs/carbon-dioxide/ Live Web t0 t9 t14 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 13. 13 Do archived pages change? TimeLive Web TimeArchive URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 t0 t2 t4 t6 t9 t14 t16 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 14. 14 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 15. 15 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 When replaying URI-M2 at different points in time, will we get the same content?
  • 16. 16 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 When replaying URI-M2 at different points in time, will we get the same content?
  • 17. 17 Do archived pages change? TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 When replaying URI-M2 at different points in time, will we get the same content?
  • 18. 18 Do archived pages change? TimeLive Web TimeArchive Replay Time When replaying URI-M2 at different points in time, will we get the same content? URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 19. 19 Do archived pages change? TimeLive Web TimeArchive Replay Time Our study shows that we are not always presented with the same archived content! URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 20. 20 Cryptographic hashes to create fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change in the output My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018. SHA256 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019. SHA256 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 21. SimHash: A small change in the input à a small change in the output My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018. SimHash ed646a9efbc77705 My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019. SimHash ed646a9efbc77305 https://github.com/leonsim/simhash • We can not use SimHash on archived pages because small changes matter
  • 22. 22 SimHash: large changes in the input à large changes in the output My name is Mohamed Aturban, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018. SimHash ed646a9efbc77705 My name is Sawood Alam, a graduate student in the Department of Computer Science at Old Dominion University. I am attending the 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019. SimHash ed666bdefbc77205 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 23. 23 Is there an existing hashing function suitable for mementos? JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 Web archive Hash value ? an archive-aware hashing function
  • 24. 24 Generate hashes on a web page • Compute a hash value on the downloaded HTML content % curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d Compute SHA256 hash Download the page JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 25. Time HTML content is downloaded e834 c71a efda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2 25 • Compare the current hash and the previous hash To verify fixity Hashes are NOT identical à the page has changed! http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  • 26. 26 What if an image has changed? Computing hashes on only HTML content will NOT detect changes JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 27. 27 Potential solution: include all resources in hash calculation https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/ • 201 images • 19 JavaScript files • 3 CSS files • Main HTML file A single aggregated hash value www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page ) has A composite memento https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html Turns out it is hard to get repeatable hashes on composite mementos
  • 28. 28 What is a composite memento? “A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation” http://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html WADL 2018, 2018-06-06 @maturban1
  • 29. 29 Research Questions R.Q.1 How can we construct an archive- aware hashing function for generating repeatable fixity information on archived web pages? R.Q.2 How to develop an approach to use this information to verify fixity and detect changes in archived resources over time? JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 30. 30 Dissertation plan Literature review Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 31. 31 Related Work TRAC (2007) Establishing trusted archives - TRAC not for playback Lerner et al. (2017) Vulnerabilities - Discovered four vulnerabilities in the Internet Archive’s Wayback Machine J. Cushman et al. (2017) More potential threats - Demonstrate potential threats in web archives Rosenthal et al. (2005) Threats - Described several threats against digital preservation systems Juan Benet (2017) Multihash - Self identifying hashes for IPFS OpenTimestamps, OriginStamp, Gipp (2015, 2016), and Chainpoint Trusted timestamps in Blockchain - Not suitable for composite mementos T. Kuhn et al. (2014) Trusty URI - A URI that contains a hash value of the content it identifies P. Maniatis et al. (2005) Distributed copies of archived resources - The scope and content are narrowly defined
  • 32. 32 Dissertation plan Literature review Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 33. 33 Archives transform original content to appropriately replay mementos in a user’s browser • Add banners • Rewrite links to point to the archive, not to the live web • Modify HTML code to convey metadata • Apply some policies for security (e.g., block some content) • Provide the content in different format (e.g., ZIP and screenshots) Transformation examples: JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 34. 34 Archives add banners • To convey information like the number of mementos and inform users that what they are viewing is from the archive • Banners change à different hashes Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos) http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
  • 35. 35 Archives rewrite links to embedded resources web.archive.org/web/19961120150251 /http://www.usnews.com:80/ http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif http://www.usnews.com:80/usnews/GRAPHICS/logo.gif JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 36. 36 Live web resources linked from archives • Resources from the live web are expected to change à different hashes • Based on feedback from Lerner et al., IA solved this issue with Content- Security-Policy HTTP header, but the problem might still occur in other archives http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Archived in 2008 The ad is from 2012 This memento was replayed in 2012 A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer and Communications Security (CCS), pages 1741–1755, 2017.
  • 37. 37 Caches will temporarily hide changes in the playback à different hashes % date Mon Oct 2 01:15:18 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:16:29 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:19:31 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 02:10:24 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 dda6a9bf091d412cbdc2226ce3eb1059 X-Page-Cache: MISS X-Page-Cache: HIT X-Page-Cache: MISS X-Page-Cache: HIT
  • 38. 38 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 Dynamic content by JS à different hashes
  • 39. 39 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 Dynamic content by JS à different hashes
  • 40. 40 Dynamic content by JS à different hashes JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 41. 41 Dynamic content by JS à different hashes JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 A large number of mementos are unavailable
  • 42. 42 A resource selected randomly by JavaScript https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 45. 45 https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ A resource selected randomly by JavaScript function random_imglink(){ myimages[1]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg"; myimages[2]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannereagle.jpg"; myimages[3]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannertiger.jpg"; var ry=Math.floor(Math.random(1)*myimages.length) if (ry==0) ry=1 document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'" border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to Read"></a>') }
  • 46. 46 Changes in TimeMaps TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 A TimeMap = a list of available mementos = URI-M1 URI-M2 URI-M3 URI-M4 URI-M5
  • 47. 47 The requested memento is unavailable TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 X JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 48. 48 Mementos with the same content are not available either TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 X JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 X X
  • 49. 49 URI-M2 redirects to other memento (URI-M4) which has different content TimeLive Web TimeArchive Replay Time URI-M1 URI-M2 URI-M3 URI-M4 URI-M5 URI-M2 URI-M2 URI-M2 t0 t2 t4 t5 t6 t9 t14 t16 t17 t18 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 XX X 302 Redirect
  • 50. HTML content is downloaded d13a 247e 872e 11d5 64f4 b49a 24d8 275c a09f ee8d 48c0 0345 f458 5d4b 7ec3 e663 Hash HTML content is downloaded 55b5 6d82 7f98 f81e 3fc6 9e03 c0c1 f739 7fa4 0bff 4e36 0303 9ddd 50a2 6ae2 8229 Hash Novermber 2017 December 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 50 Changes in TimeMaps à different hashes URI-M1 was NOT available JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 web.archive. org/web/2008 0828005922/h ttp://www.ev angelcogdayt on.org/ web.archive. org/web/2009 0211151609/h ttp://www.ev angelcogdayt on.org/ URI-M 1 URI-M 2
  • 51. 51 Transient error • Incomplete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg Download the image on December 7, 2017 WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-12-07T10:04:18Z … Content-Length: 459640 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 07 Dec 2017 10:04:18 GMT … JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 The first Content-length should be bigger than the second one
  • 52. 52 The complete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-11-16T15:34:37Z … Content-Length: 643398 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 16 Nov 2017 15:34:36 GMT … JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 This is what it should look like
  • 53. 53 Requirements for an archive-aware hashing function 1. Generate a hash on a composite memento 2. Exclude archive-specific resources 3. Avoid resources from the live web 4. Avoid content served from cache 5. Changes in TimeMaps might affect the computation of hashes 6. Avoid including dynamic content JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 54. 54 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 55. 55 Our study indicates 28% of mementos produce different hashes • 17,074 archived page • From 17 public web archives • Downloaded 20 times • Between November 16, 2017 and March 27, 2018 JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 Preliminary work
  • 56. 56 Archive Mementos Mementos with different hashes (%) archive.org 1,600 1,027 (64%) webarchive.loc.gov 1,600 821 (51%) vefsafn.is 1,600 764 (48%) arquivo.pt 1,600 305 (19%) webcitation.org 1,600 57 (4%) archive.is 1,600 0 (0%) archive-it.org 1,407 489 (35%) swap.stanford.edu 1,233 195 (16%) nationalarchives.gov.uk 1,011 95 (9%) europarchive.org 990 97 (10%) webharvest.gov 733 178 (24) digar.ee 518 81 (16%) webarchive.proni.gov.uk 477 50 (10%) webarchive.org.uk 362 275 (76%) collectionscanada.gc.ca 359 13 (4%) archive.bibalex.org 202 156 (77%) perma-archives.org 182 147 (81%) 17,074 4,750 (28%) Mementos with different hashes per archive
  • 57. 57 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 58. An Approach for Verifying Fixity of Archived Web Resources 58 Use web archives to monitor web archives JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 59. Step 1: Push to multiple archives 59 climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 60. Step 2: Compute fixity, publish fixity “manifest” at a well-known location 60 climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov • Compute fixity on things that should not change, like JPEGs and certain original HTTP response headers. • This example assumes the existence of a well-known server manifest.org. • Actual URIs can be a bit more complex using “Trusty URIs”: http://ws- dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 61. Wondering about veracity of an archived page? Check manifest.org and recompute fixity. 61 climate.nasa.gov climate.nasa.gov We can’t know archive.org did not alter contents on ingest (20180321), but we can verify that it has not changed since our observation (20180322) What if manifest.org is down? or possibly hacked? JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 62. Step 3: Push manifest to multiple archives 62 climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov • Now the 20180322 version of the manifest of archive.org’s memento of climate.nasa.gov is in four different archives. • The URIs are more complex, but the bottom line is an attacker would have to hack a majority of 5 domains (manifest.org + 4 archives) JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 63. Wondering about veracity of an archived page? Check all copies of manifest.org and take a majority vote 63 climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov climate.nasa.gov • Caveat 1: If I can hack nasa.gov page at archive.org, I can probably hack the fixity info there too, so we really have 4 copies not 5. • Caveat 2: archive.org and archive-it.org are not independent, • so we really have 3 copies not 5. JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 64. 64 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 65. Probability model of rendering the same archived pages 65 - Download a number of mementos multiple times at different points in time - Compute a hash on each memento after each download - Try to find an answer to the question: what is the probability of getting the same hash value?
  • 66. 66 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 67. 67 - Open Annotation Data Model (OA) - Open Archives Initiative: Object Reuse and Exchange (OAI-ORE) - Linked Data Platform (LDP) - Open Annotation Protocol - WAT/WET - BagIt Study different existing data models for serializing fixity information and select one
  • 68. 68 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 69. 69 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times PreliminaryworkFuturework
  • 70. When can fixity information be generated? 70 1. By the archive on ingest 2. At any time after the archived content is made available time • A web page is captured by the archive at t0 1 t0 t7 1 2 • By , we can trust the archived page since t0 • By , we can trust the archived page since t7. but we cannot detect changes before t7 1 2 2
  • 71. Is our proposed framework scalable? 71 • Time required to generate fixity information per memento per archive • Time required to verify fixity of mementos • The size of fixity information ”manifest” • Scalable? Or just on important mementos (e.g., the archived NASA page) JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 72. Only independent copies of manifest should be counted 72 • Copies in Internet Archives and Archive-it are not independent • Copies of copies are not independent JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1
  • 73. 73 Dissertation plan Read Literature Identify types of changes and define requirements for generating repeatable hashes Build a probability model of rendering the same archived pages PhD Defense Study different existing data models for serializing fixity information and select one Implement different services to: - Generate fixity information - Publish fixity information on the web - Verify archived resources Evaluate the framework Finish writing the dissertation Define the framework’s structure of verifying fixity Analyze the same archived pages downloaded at different times [ Current State ] Preliminarywork July 2018 November 2018 August 2018 February 2019 May 2019
  • 75. 75 Creating trusted archives JCDL 2018 Doctoral Consortium, 2018-06-03 @maturban1 Trustworthy Repositories Audit & Certification: Criteria and Checklist, Version 1.0, The Center for Research Libraries and OCLC Online Computer Library Center, Inc. (2007). ( Sections B2.9 and B4.4 ) • Web archives must create preservation metadata that can be used to verify fixity • Preserved content should be stored separately from fixity information ( hard for someone to alter both ) Related Work
  • 76. 76 Threats • D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, and S. Morabito. Transparent Format Migration of Preserved Web Content. D-Lib Magazine, 11(1), 2005. • A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer and Communications Security (CCS), pages 1741–1755, 2017. • J. Cushman and I. Kreymer. Thinking like a hacker: Security Considerations for High-Fidelity Web Archives. http://labs.rhizome.org/presentations/security.html, May 2017. • Rosenthal et al. described several threats against digital preservation systems: • Lerner et al. discovered four vulnerabilities in the Internet Archive’s Wayback Machine (i.e., Archive-Escapes, Same-Origin Escapes, Archive- Escapes + Same-Origin Escapes, and Anachronism- Injection) that attackers can leverage to modify a user’s view in a browser • Cushman and Kreymer created a shared repository in May 2017 to demonstrate potential threats in web archives (e.g., controlling a user’s account due to Cross-Site Request Forgery (CSRF) or archived web resources reaching out to the live web) - Media failure, software failure, failure of network services, natural disaster, internal attack, organizational failure, hardware failure, … Related Work
  • 77. 77 Hashes in URIs • T. Kuhn and M. Dumontier. Trusty uris: Verifiable, immutable, and permanent digital artifacts for linked data. In European Semantic Web Conference, pages 395–410. Springer, 2014. • T. Kuhn and M. Dumontier. Making digital artifacts on the web verifiable and reliable. IEEE Transactions on Knowledge and Data Engineering, 27(9):2390–2400, 2015. • Multihash. https://github.com/multiformats/multihash. • Kuhn et al. define a Trusty URI as a URI that contains a cryptographic hash value of the content it identifies. Trusty URIs can be generated on only two types of content RDF graphs and byte-level content (i.e., no modules introduced for HTML documents). Related Work • Juan Benet introduced Multihash to mainly create self identifying hashes for IPFS content
  • 78. 78 Trusted timestamps in blockchain-based networks • https://originstamp.org • https://chainpoint.org/ • https://opentimestamps.org/ • A. Wright and P. De Filippi. Decentralized blockchain technology and the rise of lex cryp- tographia. In SSRN Electronic Journal, 2015. • OpenTimestamps, OriginStamp, and Chainpoint generate trusted timestamps using Bitcoin blockchain. Related Work File • The common steps for timestamping: 1. Receiving a file, a hash, or plain text from a user 2. Generating a hash value of received content 3. Converting the hash to a Bitcoin address 4. Issuing a Bitcoin transaction using the Bitcoin address • The timestamp associated with the transaction is used as a trusted timestamp
  • 79. 79 Distributed copies of archived resources • P. Maniatis, M. Roussopoulos, T. J. Giuli, D. S. Rosenthal, and M. Baker. The lockss peer-to-peer digital preservation system. ACM Transactions on Computer Systems (TOCS), 23(1):2–50, 2005. • B. Kahle. Help Us Keep the Archive Free, Accessible, and Reader Private. https: //blog.archive.org/2016/11/29/help-us-keep-the-archive-free- accessible-and- private/, November 2016. Related Work - Built so that each participating library has its own copy of scholarly papers - LOCKSS regularly compares these copies and detects corrupted ones based on voting on the cryptographic hash of the content - Replacing any corrupted copy with the right content • The Internet Archive are planning to build a new archive in Canada to duplicate all current archived collections. • Lots of Copies Keep Stuff Safe (LOCKSS)
  • 80. December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 80 URI-M1 was NOT available URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G Changes in TimeMaps à different image à different hashes • You can't see the difference in the URI-M of the main HTML file, but you can see the difference in the embedded images https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
  • 81. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M2 = perma-archives.org/warc/20170619145458id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M1 was NOT available Different image Changes in TimeMaps à different image that looks the same à different hashes • You can't see the difference in the URI-M of the main HTML file nor the difference in the embedded images http://perma-archives.org/warc/20170101182813id_/http://umich.edu/ http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
  • 82. 82 http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/ Requesting the raw version, received ”200 OK” with a rewritten version that indicates “302 Redirect” curl -I http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/ HTTP/1.1 200 OK Date: Tue, 05 Jun 2018 17:34:19 GMT Server: Apache/2.4.6 (Red Hat Enterprise Linux) Content-Security-Policy: default-src 'self' style-src 'self' 'unsafe-inline' Memento-Datetime: Wed, 13 Mar 2013 21:04:47 GMT … http://wayback.vefsafn.is/wayback/20130313210447id_/http://vkontakte.ru/
  • 83. 83 http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ Requesting the raw version of webharvest.gov/congress110th/2008 1124195939id_/http://www.usda.gov/, it redirects to the live web
  • 84. 84 http://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ Requesting the raw version of webharvest.gov/congress110th/2008 1124195939id_/http://www.usda.gov/, it redirects to the live web curl -iL --silent webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ | egrep -i "(HTTP/1.1|^location:)" HTTP/1.1 301 Moved Permanently Location: https://webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ HTTP/1.1 301 Moved Permanently Location: https://www.webharvest.gov/congress110th/20081124195939id_/http://www.usda.gov/ HTTP/1.1 302 Found Location: http://www.usda.gov/wps/portal/usdahome HTTP/1.1 301 Moved Permanently Location: https://www.usda.gov/wps/portal/usdahome location: https://www.usda.gov/
  • 85. Requesting the raw version, a third party service (Cloudflare) injects HTML code curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:45 GMT <a href="/cdn-cgi/l/email- protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464 54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1 207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” … curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:50 GMT <a href="/cdn-cgi/l/email- protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060 50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5 247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” … curl -silent http://perma- archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:51 GMT <a href="/cdn-cgi/l/email- protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d 4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8 39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …