Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Archive Assisted Archival
Fixity Verification Framework
JCDL 2019
Urbana-Champaign, Illinois
June 2-6, 2019
Mohamed Aturba...
2
This is what
https://climate.nasa.gov/vital-signs/carbon-dioxide/
looks like right now
Archive Assisted Archival Fixity ...
3
The Internet Archive allows us to view
previous versions (mementos) of that page
Archive Assisted Archival Fixity Verifi...
4
http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive Assisted Archival Fixity Verif...
https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/
An archived page (memento)...
6
The page is in other web archives
For a full list of public web archives, see: http://labs.mementoweb.org/aggregator_con...
7
The web page is archived by
Michael’s Evil Wayback in July 2017
Michaelsevilwayback/web/20170717185130/https://climate.n...
8
Replaying the same memento in October 2017,
we got a different CO2
Michaelsevilwayback/web/20170717185130/https://climat...
9
Which one is the real memento?
July 2017 October 2017
• How to ensure that a memento has remained unaltered
since the ti...
10
Cryptographic hashes to create
fixity information
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the i...
11
Compute a hash value on the
downloaded HTML
$ curl -s https://climate.nasa.gov/vital-
signs/carbon-dioxide/ | shasum -a...
Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Has...
Two approaches for verifying the
fixity of archived web pages
13
• The Atomic approach
• Generate a manifest file (a JSON ...
Atomic approach (step 1):
Push a web page into multiple archives
14
https://archive.is/20181224085310/
https://2019.jcdl.o...
Atomic approach (steps 2 & 3):
For each memento, compute fixity “manifest”
and publish it on the web at a well-known
Archi...
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/ma...
{
"@context": "http://manifest.ws-dl.cs.odu.edu/",
"created": "Sun, 23 Dec 2018 11:43:55 GMT",
"@id": "http://manifest.ws-...
Atomic approach (step 4):
Push manifests into multiple archives
• In this example, the memento is in the Internet Archive ...
Block approach (step 1):
Batch together fixity information of
multiple mementos in a single file (block)
• Adding addition...
Block approach (step 2):
Publish the block file at the Archival Fixity server
always redirects to the
latest published blo...
Block approach (step 3):
Push the blocks entrypoint into
multiple archives
https://manifest.ws-dl
.cs.odu.edu/blocks
https...
Three steps to verify the fixity
of a memento
1. Discover a manifest/block
• In Atomic approach, this includes discovering...
Evaluation
• A data set of 1,000 mementos from the Internet Archive
• For each memento, we generated and disseminated 3 ma...
24
Increasing the number of records per block
reduces the block generation time
Archive Assisted Archival Fixity Verificat...
25
The Block approach creates fewer resources
in archives than the Atomic approach
• Given a collection of N = 1,000 memen...
Dissemination/download time
varies from one archive to another
26
Archive Assisted Archival Fixity Verification Framework ...
It takes 1.25x, 4x and 36x longer to disseminate a
manifest to perma.cc, archive.org, and
webcitation.org, respectively, t...
It takes 3.5x longer to disseminate a
manifest to archive.org than perma.cc
28
Archive Assisted Archival Fixity Verificati...
Average time for discovering published
fixity information
29
Archive Assisted Archival Fixity Verification Framework · JCD...
The Block approach performs 4.46x faster than the
Atomic approach in verifying the fixity of mementos
30
• The fixity veri...
Conclusions
31
• The proposed approaches do not require any changes in the
infrastructure of web archives
• The Block appr...
Upcoming SlideShare
Loading in …5
×

Archive Assisted Archival Fixity Verification Framework

704 views

Published on

The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure an archived resource has remained unaltered since the time it was captured. Some web archives do not allow users to access fixity information and, more importantly, even if fixity information is available, it is provided by the same archive from which the archived resources are requested. In this research, we propose two approaches, namely Atomic and Block, to establish and check fixity of archived resources.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Archive Assisted Archival Fixity Verification Framework

  1. 1. Archive Assisted Archival Fixity Verification Framework JCDL 2019 Urbana-Champaign, Illinois June 2-6, 2019 Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle Old Dominion University Department of Computer Science Norfolk, Virginia 23529 USA
  2. 2. 2 This is what https://climate.nasa.gov/vital-signs/carbon-dioxide/ looks like right now Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  3. 3. 3 The Internet Archive allows us to view previous versions (mementos) of that page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  4. 4. 4 http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  5. 5. https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/ An archived page (memento) from July 2018 5 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  6. 6. 6 The page is in other web archives For a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml Typical archive URI construction: wayback.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide 4,172 62 3 13 webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/ arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/ perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/ archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/ wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/ web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ Mementos available 3 39
  7. 7. 7 The web page is archived by Michael’s Evil Wayback in July 2017 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  8. 8. 8 Replaying the same memento in October 2017, we got a different CO2 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  9. 9. 9 Which one is the real memento? July 2017 October 2017 • How to ensure that a memento has remained unaltered since the time it was captured by the archive? Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  10. 10. 10 Cryptographic hashes to create fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change output SHA256(HTML) 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256(HTML)
  11. 11. 11 Compute a hash value on the downloaded HTML $ curl -s https://climate.nasa.gov/vital- signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec 1209c4ad4efbe2c035d Compute SHA256 hashDownload the page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  12. 12. Time HTML content is downloaded e834 c71a efda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2 12 Compare the current hash with previously calculated hash To verify the fixity Hashes are NOT identical à the page has changed! http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  13. 13. Two approaches for verifying the fixity of archived web pages 13 • The Atomic approach • Generate a manifest file (a JSON file containing the fixity information) for each memento • Publish the manifest at a well-known web location • Disseminate the published manifest to several archives • The Block approach • Batch together fixity information of multiple mementos in a single binary-searchable file (or block) • Publish the block at a well-known web location • Disseminate the published block to several archives (Use web archives to monitor web archives) Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  14. 14. Atomic approach (step 1): Push a web page into multiple archives 14 https://archive.is/20181224085310/ https://2019.jcdl.org/ https://web.archive.org/web/201812 24085329/https://2019.jcdl.org/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ http://www.webcitation.org74tsy6pU0 https://2019. jcdl.org/ This results in creating multiple mementos of the web page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  15. 15. Atomic approach (steps 2 & 3): For each memento, compute fixity “manifest” and publish it on the web at a well-known Archival Fixity server 15 manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 • In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server • Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws-dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
  16. 16. manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 The manifest’s generic URI always redirects to the most recent time-specific manifest version (trusty URI) $curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|^location:)" HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/2018122409302 4/8c31ccfbb3a664c9160f98be466b7c9fb9a fa80580ab5052001174be59c6a73 a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/ HTTP/2 200 manifest’s trusty URI manifest’s generic URI The structure of generic URIs is easy to remember <Archival-Fixity-Server>/<URI to memento> So they can be used to look up manifests in both the Archival Fixity server and archives 16
  17. 17. { "@context": "http://manifest.ws-dl.cs.odu.edu/", "created": "Sun, 23 Dec 2018 11:43:55 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv e.org/web/2018121102034/https://2019.jcdl .org/", "uri-r": "https://2019.jcdl.org/", "uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/", "memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT", "http-headers": { "Content-Type": "text/html; charset=UTF-8", "X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT", "X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>; rel="https://api.w.org/"", "Preference-Applied": "original-links, original-content” }, "hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive- Orig-date $X-Archive-O rig-link') | tee >(sha256sum) >(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’ <(echo -e 'md5nsha256') - | paste -d' ' - -", "hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9 a87f49646456369e51791a86106a6667630737a6" } A manifest file example • Including how hashes are computed • Hashes are computed on only base HTML file • Compute fixity on things that should not change like certain original HTTP response headers Trusty URI 17 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  18. 18. Atomic approach (step 4): Push manifests into multiple archives • In this example, the memento is in the Internet Archive (IA) and its fixity information is disseminated to four archives including IA • An attacker would have to hack a majority of 4 domains (archives) https://archive.is/20181224093334/http://manifest. ws-dl.cs.odu.edu/manifest/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ https://web.archive.org/web/20181224093355/http:// manifest.ws-dl.cs.odu.edu/manifest/https://web.arc hive.org/web/20181224085329/https://2019.jcdl.org/ https://perma-archives.org/warc/20181224093354/htt p://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl. org/ http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/ manifest/https://web.archi ve.org/web/20181224085329/ https://2019.jcdl.org/ 18 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  19. 19. Block approach (step 1): Batch together fixity information of multiple mementos in a single file (block) • Adding additional metadata (e.g., created_at, fields, …) • The hash of the previous block must be added !context ["http://oduwsdl.github.io/contexts/fixity"] !fields {keys: ["surt"]} !id {uri: "https://manifest.ws-dl.cs.odu.edu/"} !meta {created_at: "20190111181327"} !meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"} !meta {type: "FixityBlock"} org,archive,web)/web/19961022175434/http://search.com org,archive,web)/web/19961219082428/http://sho.com org,archive,web)/web/19961223174001/http://reference.com … 19 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  20. 20. Block approach (step 2): Publish the block file at the Archival Fixity server always redirects to the latest published block manifest.ws-dl.cs.odu.edu/blocks The blocks entrypoint 20 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  21. 21. Block approach (step 3): Push the blocks entrypoint into multiple archives https://manifest.ws-dl .cs.odu.edu/blocks https://web.archive.org/web/20190121054059 /https://manifest.ws-dl.cs.odu.edu/blocks/7bbf 757046ac0a0a60015a1cb847c3189160d18c809 b210073822df157609e01 • Will result in archiving the latest published block as well https://perma.cc/8YG3-X7KN 21 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  22. 22. Three steps to verify the fixity of a memento 1. Discover a manifest/block • In Atomic approach, this includes discovering the archived manifests 2. Compute current fixity information of the memento 3. Compare current fixity information with the discovered manifests/block. $ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/ 20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)” HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38 d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201 71115140705/http://rln.fm/ An example of discovering the latest manifest in the Archival Fixity server for the memento web.archive.org/web/2017111 5140705/http://rln.fm/ 22 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  23. 23. Evaluation • A data set of 1,000 mementos from the Internet Archive • For each memento, we generated and disseminated 3 manifests to 4 archives 23 • The average size of a manifest file is 1,157 bytes • The manifest size represents 2.79% of the actual download HTML Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  24. 24. 24 Increasing the number of records per block reduces the block generation time Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  25. 25. 25 The Block approach creates fewer resources in archives than the Atomic approach • Given a collection of N = 1,000 mementos • K = 4 web archives • The selected block size B = 100 records per block • The total number of resources created in the archives: • Atomic (N ∗ K) = 4,000 • Block (k ∗ (N/B)) = 40 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  26. 26. Dissemination/download time varies from one archive to another 26 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  27. 27. It takes 1.25x, 4x and 36x longer to disseminate a manifest to perma.cc, archive.org, and webcitation.org, respectively, than archive.is 27 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  28. 28. It takes 3.5x longer to disseminate a manifest to archive.org than perma.cc 28 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  29. 29. Average time for discovering published fixity information 29 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  30. 30. The Block approach performs 4.46x faster than the Atomic approach in verifying the fixity of mementos 30 • The fixity verification time includes: • Discovering manifests • Computing current fixity information • Downloading the archived manifests • Comparing results • On average, the verification time of a memento is 6.65 seconds by the Atomic approach and 1.49 seconds by the Block approach Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  31. 31. Conclusions 31 • The proposed approaches do not require any changes in the infrastructure of web archives • The Block approach creates fewer resources in archives and reduces fixity verification time (4.46x faster than the Atomic approach) • The Atomic approach has the ability to verify the fixity of archived pages even without using the Archival Fixity server • Varying/increasing the block size could be one potential solution to improve the Block approach performance and reduce the number of resources created in archives • Caching archived manifests/blocks in the Archival Fixity server should improve the performance of both approaches

×