Archive Assisted Archival
Fixity Verification Framework
JCDL 2019
Urbana-Champaign, Illinois
June 2-6, 2019
Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle
Old Dominion University
Department of Computer Science
Norfolk, Virginia 23529 USA
2
This is what
https://climate.nasa.gov/vital-signs/carbon-dioxide/
looks like right now
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
3
The Internet Archive allows us to view
previous versions (mementos) of that page
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
4
http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/
An archived page (memento) from July 2018
5
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
6
The page is in other web archives
For a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
Typical archive URI construction:
wayback.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide
4,172
62
3
13
webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/
arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/
perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/
archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/
wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/
web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
Mementos
available
3
39
7
The web page is archived by
Michael’s Evil Wayback in July 2017
Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
8
Replaying the same memento in October 2017,
we got a different CO2
Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
9
Which one is the real memento?
July 2017 October 2017
• How to ensure that a memento has remained unaltered
since the time it was captured by the archive?
Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
10
Cryptographic hashes to create
fixity information
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input à a large change output
SHA256(HTML)
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
SHA256(HTML)
11
Compute a hash value on the
downloaded HTML
$ curl -s https://climate.nasa.gov/vital-
signs/carbon-dioxide/ | shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec
1209c4ad4efbe2c035d
Compute SHA256 hashDownload the page
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Hash
HTML
content is
downloaded
fc90 88b3 a614 a588 40bd 5387 d93c 16be
824c d2bb b3fa b173 f93f a57d 241a 3790
SHA256
Hash
August 2017
October 2017
The archived page has been tampered with by changing the value of COSeptember 2017
2
12
Compare the current hash with previously calculated hash
To verify the fixity
Hashes are NOT identical à the page has changed!
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
Two approaches for verifying the
fixity of archived web pages
13
• The Atomic approach
• Generate a manifest file (a JSON file containing the fixity
information) for each memento
• Publish the manifest at a well-known web location
• Disseminate the published manifest to several archives
• The Block approach
• Batch together fixity information of multiple mementos
in a single binary-searchable file (or block)
• Publish the block at a well-known web location
• Disseminate the published block to several archives
(Use web archives to monitor web archives)
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Atomic approach (step 1):
Push a web page into multiple archives
14
https://archive.is/20181224085310/
https://2019.jcdl.org/
https://web.archive.org/web/201812
24085329/https://2019.jcdl.org/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
http://www.webcitation.org74tsy6pU0
https://2019.
jcdl.org/
This results in creating multiple mementos of the web page
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Atomic approach (steps 2 & 3):
For each memento, compute fixity “manifest”
and publish it on the web at a well-known
Archival Fixity server
15
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://web.archive.org/web/2018122
4085329/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
http://www.webcitation.org/74tsy6pU0
• In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server
• Actual URIs to manifests can be a bit more complex using “Trusty URIs”:
http://ws-dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://web.archive.org/web/2018122
4085329/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
http://www.webcitation.org/74tsy6pU0
The manifest’s generic URI always
redirects to the most recent time-specific
manifest version (trusty URI)
$curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web
.archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i
"(HTTP/|^location:)"
HTTP/2 302
location: https://manifest.ws-dl.cs.odu.edu/manifest/2018122409302
4/8c31ccfbb3a664c9160f98be466b7c9fb9a fa80580ab5052001174be59c6a73
a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/
HTTP/2 200
manifest’s trusty URI manifest’s generic URI
The structure of generic URIs is easy to remember
<Archival-Fixity-Server>/<URI to memento>
So they can be used to look up manifests in both the Archival Fixity server and archives
16
{
"@context": "http://manifest.ws-dl.cs.odu.edu/",
"created": "Sun, 23 Dec 2018 11:43:55 GMT",
"@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb
e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv
e.org/web/2018121102034/https://2019.jcdl .org/",
"uri-r": "https://2019.jcdl.org/",
"uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/",
"memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT",
"http-headers": {
"Content-Type": "text/html; charset=UTF-8",
"X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT",
"X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>;
rel="https://api.w.org/"",
"Preference-Applied": "original-links, original-content” },
"hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive-
Orig-date $X-Archive-O rig-link') | tee >(sha256sum)
>(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’
<(echo -e 'md5nsha256') - | paste -d' ' - -",
"hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9
a87f49646456369e51791a86106a6667630737a6"
}
A manifest file example
• Including how hashes are computed
• Hashes are computed on only base HTML file
• Compute fixity on things that should not change like certain original HTTP
response headers
Trusty
URI
17
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Atomic approach (step 4):
Push manifests into multiple archives
• In this example, the memento is in the Internet Archive (IA) and
its fixity information is disseminated to four archives including IA
• An attacker would have to hack a majority of 4 domains (archives)
https://archive.is/20181224093334/http://manifest.
ws-dl.cs.odu.edu/manifest/https://web.archive.org/
web/20181224085329/https://2019.jcdl.org/
https://web.archive.org/web/20181224093355/http://
manifest.ws-dl.cs.odu.edu/manifest/https://web.arc
hive.org/web/20181224085329/https://2019.jcdl.org/
https://perma-archives.org/warc/20181224093354/htt
p://manifest.ws-dl.cs.odu.edu/manifest/https://web
.archive.org/web/20181224085329/https://2019.jcdl.
org/
http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/
manifest/https://web.archi
ve.org/web/20181224085329/
https://2019.jcdl.org/
18
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Block approach (step 1):
Batch together fixity information of
multiple mementos in a single file (block)
• Adding additional metadata (e.g., created_at, fields, …)
• The hash of the previous block must be added
!context ["http://oduwsdl.github.io/contexts/fixity"]
!fields {keys: ["surt"]}
!id {uri: "https://manifest.ws-dl.cs.odu.edu/"}
!meta {created_at: "20190111181327"}
!meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"}
!meta {type: "FixityBlock"}
org,archive,web)/web/19961022175434/http://search.com
org,archive,web)/web/19961219082428/http://sho.com
org,archive,web)/web/19961223174001/http://reference.com
…
19
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Block approach (step 2):
Publish the block file at the Archival Fixity server
always redirects to the
latest published block
manifest.ws-dl.cs.odu.edu/blocks
The blocks entrypoint
20
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Block approach (step 3):
Push the blocks entrypoint into
multiple archives
https://manifest.ws-dl
.cs.odu.edu/blocks
https://web.archive.org/web/20190121054059
/https://manifest.ws-dl.cs.odu.edu/blocks/7bbf
757046ac0a0a60015a1cb847c3189160d18c809
b210073822df157609e01
• Will result in archiving the latest published block as well
https://perma.cc/8YG3-X7KN
21
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Three steps to verify the fixity
of a memento
1. Discover a manifest/block
• In Atomic approach, this includes discovering the archived
manifests
2. Compute current fixity information of the memento
3. Compare current fixity information with the discovered
manifests/block.
$ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/
20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)”
HTTP/2 302
location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38
d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201
71115140705/http://rln.fm/
An example of discovering the latest manifest in the Archival Fixity server
for the memento web.archive.org/web/2017111 5140705/http://rln.fm/
22
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Evaluation
• A data set of 1,000 mementos from the Internet Archive
• For each memento, we generated and disseminated 3 manifests
to 4 archives
23
• The average size
of a manifest file
is 1,157 bytes
• The manifest size
represents 2.79%
of the actual
download HTML
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
24
Increasing the number of records per block
reduces the block generation time
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
25
The Block approach creates fewer resources
in archives than the Atomic approach
• Given a collection of N = 1,000 mementos
• K = 4 web archives
• The selected block size B = 100 records per block
• The total number of resources created in the archives:
• Atomic
(N ∗ K) = 4,000
• Block
(k ∗ (N/B)) = 40
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Dissemination/download time
varies from one archive to another
26
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
It takes 1.25x, 4x and 36x longer to disseminate a
manifest to perma.cc, archive.org, and
webcitation.org, respectively, than archive.is
27
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
It takes 3.5x longer to disseminate a
manifest to archive.org than perma.cc
28
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Average time for discovering published
fixity information
29
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
The Block approach performs 4.46x faster than the
Atomic approach in verifying the fixity of mementos
30
• The fixity verification time includes:
• Discovering manifests
• Computing current fixity information
• Downloading the archived manifests
• Comparing results
• On average, the verification
time of a memento is 6.65
seconds by the Atomic
approach and 1.49 seconds by
the Block approach
Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
Conclusions
31
• The proposed approaches do not require any changes in the
infrastructure of web archives
• The Block approach creates fewer resources in archives and
reduces fixity verification time (4.46x faster than the Atomic
approach)
• The Atomic approach has the ability to verify the fixity of
archived pages even without using the Archival Fixity server
• Varying/increasing the block size could be one potential solution
to improve the Block approach performance and reduce the
number of resources created in archives
• Caching archived manifests/blocks in the Archival Fixity server
should improve the performance of both approaches

Archive Assisted Archival Fixity Verification Framework

  • 1.
    Archive Assisted Archival FixityVerification Framework JCDL 2019 Urbana-Champaign, Illinois June 2-6, 2019 Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle Old Dominion University Department of Computer Science Norfolk, Virginia 23529 USA
  • 2.
    2 This is what https://climate.nasa.gov/vital-signs/carbon-dioxide/ lookslike right now Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 3.
    3 The Internet Archiveallows us to view previous versions (mementos) of that page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 4.
    4 http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted ArchivalFixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 5.
    https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/ An archived page(memento) from July 2018 5 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 6.
    6 The page isin other web archives For a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml Typical archive URI construction: wayback.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide 4,172 62 3 13 webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/ arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/ perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/ archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/ wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/ web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ Mementos available 3 39
  • 7.
    7 The web pageis archived by Michael’s Evil Wayback in July 2017 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 8.
    8 Replaying the samememento in October 2017, we got a different CO2 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 9.
    9 Which one isthe real memento? July 2017 October 2017 • How to ensure that a memento has remained unaltered since the time it was captured by the archive? Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 10.
    10 Cryptographic hashes tocreate fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change output SHA256(HTML) 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256(HTML)
  • 11.
    11 Compute a hashvalue on the downloaded HTML $ curl -s https://climate.nasa.gov/vital- signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec 1209c4ad4efbe2c035d Compute SHA256 hashDownload the page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 12.
    Time HTML content is downloaded e834 c71aefda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2 12 Compare the current hash with previously calculated hash To verify the fixity Hashes are NOT identical à the page has changed! http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  • 13.
    Two approaches forverifying the fixity of archived web pages 13 • The Atomic approach • Generate a manifest file (a JSON file containing the fixity information) for each memento • Publish the manifest at a well-known web location • Disseminate the published manifest to several archives • The Block approach • Batch together fixity information of multiple mementos in a single binary-searchable file (or block) • Publish the block at a well-known web location • Disseminate the published block to several archives (Use web archives to monitor web archives) Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 14.
    Atomic approach (step1): Push a web page into multiple archives 14 https://archive.is/20181224085310/ https://2019.jcdl.org/ https://web.archive.org/web/201812 24085329/https://2019.jcdl.org/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ http://www.webcitation.org74tsy6pU0 https://2019. jcdl.org/ This results in creating multiple mementos of the web page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 15.
    Atomic approach (steps2 & 3): For each memento, compute fixity “manifest” and publish it on the web at a well-known Archival Fixity server 15 manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 • In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server • Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws-dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
  • 16.
    manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 The manifest’s genericURI always redirects to the most recent time-specific manifest version (trusty URI) $curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|^location:)" HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/2018122409302 4/8c31ccfbb3a664c9160f98be466b7c9fb9a fa80580ab5052001174be59c6a73 a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/ HTTP/2 200 manifest’s trusty URI manifest’s generic URI The structure of generic URIs is easy to remember <Archival-Fixity-Server>/<URI to memento> So they can be used to look up manifests in both the Archival Fixity server and archives 16
  • 17.
    { "@context": "http://manifest.ws-dl.cs.odu.edu/", "created": "Sun,23 Dec 2018 11:43:55 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv e.org/web/2018121102034/https://2019.jcdl .org/", "uri-r": "https://2019.jcdl.org/", "uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/", "memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT", "http-headers": { "Content-Type": "text/html; charset=UTF-8", "X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT", "X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>; rel="https://api.w.org/"", "Preference-Applied": "original-links, original-content” }, "hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive- Orig-date $X-Archive-O rig-link') | tee >(sha256sum) >(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’ <(echo -e 'md5nsha256') - | paste -d' ' - -", "hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9 a87f49646456369e51791a86106a6667630737a6" } A manifest file example • Including how hashes are computed • Hashes are computed on only base HTML file • Compute fixity on things that should not change like certain original HTTP response headers Trusty URI 17 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 18.
    Atomic approach (step4): Push manifests into multiple archives • In this example, the memento is in the Internet Archive (IA) and its fixity information is disseminated to four archives including IA • An attacker would have to hack a majority of 4 domains (archives) https://archive.is/20181224093334/http://manifest. ws-dl.cs.odu.edu/manifest/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ https://web.archive.org/web/20181224093355/http:// manifest.ws-dl.cs.odu.edu/manifest/https://web.arc hive.org/web/20181224085329/https://2019.jcdl.org/ https://perma-archives.org/warc/20181224093354/htt p://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl. org/ http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/ manifest/https://web.archi ve.org/web/20181224085329/ https://2019.jcdl.org/ 18 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 19.
    Block approach (step1): Batch together fixity information of multiple mementos in a single file (block) • Adding additional metadata (e.g., created_at, fields, …) • The hash of the previous block must be added !context ["http://oduwsdl.github.io/contexts/fixity"] !fields {keys: ["surt"]} !id {uri: "https://manifest.ws-dl.cs.odu.edu/"} !meta {created_at: "20190111181327"} !meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"} !meta {type: "FixityBlock"} org,archive,web)/web/19961022175434/http://search.com org,archive,web)/web/19961219082428/http://sho.com org,archive,web)/web/19961223174001/http://reference.com … 19 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 20.
    Block approach (step2): Publish the block file at the Archival Fixity server always redirects to the latest published block manifest.ws-dl.cs.odu.edu/blocks The blocks entrypoint 20 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 21.
    Block approach (step3): Push the blocks entrypoint into multiple archives https://manifest.ws-dl .cs.odu.edu/blocks https://web.archive.org/web/20190121054059 /https://manifest.ws-dl.cs.odu.edu/blocks/7bbf 757046ac0a0a60015a1cb847c3189160d18c809 b210073822df157609e01 • Will result in archiving the latest published block as well https://perma.cc/8YG3-X7KN 21 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 22.
    Three steps toverify the fixity of a memento 1. Discover a manifest/block • In Atomic approach, this includes discovering the archived manifests 2. Compute current fixity information of the memento 3. Compare current fixity information with the discovered manifests/block. $ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/ 20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)” HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38 d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201 71115140705/http://rln.fm/ An example of discovering the latest manifest in the Archival Fixity server for the memento web.archive.org/web/2017111 5140705/http://rln.fm/ 22 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 23.
    Evaluation • A dataset of 1,000 mementos from the Internet Archive • For each memento, we generated and disseminated 3 manifests to 4 archives 23 • The average size of a manifest file is 1,157 bytes • The manifest size represents 2.79% of the actual download HTML Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 24.
    24 Increasing the numberof records per block reduces the block generation time Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 25.
    25 The Block approachcreates fewer resources in archives than the Atomic approach • Given a collection of N = 1,000 mementos • K = 4 web archives • The selected block size B = 100 records per block • The total number of resources created in the archives: • Atomic (N ∗ K) = 4,000 • Block (k ∗ (N/B)) = 40 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 26.
    Dissemination/download time varies fromone archive to another 26 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 27.
    It takes 1.25x,4x and 36x longer to disseminate a manifest to perma.cc, archive.org, and webcitation.org, respectively, than archive.is 27 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 28.
    It takes 3.5xlonger to disseminate a manifest to archive.org than perma.cc 28 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 29.
    Average time fordiscovering published fixity information 29 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 30.
    The Block approachperforms 4.46x faster than the Atomic approach in verifying the fixity of mementos 30 • The fixity verification time includes: • Discovering manifests • Computing current fixity information • Downloading the archived manifests • Comparing results • On average, the verification time of a memento is 6.65 seconds by the Atomic approach and 1.49 seconds by the Block approach Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  • 31.
    Conclusions 31 • The proposedapproaches do not require any changes in the infrastructure of web archives • The Block approach creates fewer resources in archives and reduces fixity verification time (4.46x faster than the Atomic approach) • The Atomic approach has the ability to verify the fixity of archived pages even without using the Archival Fixity server • Varying/increasing the block size could be one potential solution to improve the Block approach performance and reduce the number of resources created in archives • Caching archived manifests/blocks in the Archival Fixity server should improve the performance of both approaches