Successfully reported this slideshow.
Your SlideShare is downloading. ×

Archive Assisted Archival Fixity Verification Framework

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 31 Ad

Archive Assisted Archival Fixity Verification Framework

Download to read offline

The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure an archived resource has remained unaltered since the time it was captured. Some web archives do not allow users to access fixity information and, more importantly, even if fixity information is available, it is provided by the same archive from which the archived resources are requested. In this research, we propose two approaches, namely Atomic and Block, to establish and check fixity of archived resources.

The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure an archived resource has remained unaltered since the time it was captured. Some web archives do not allow users to access fixity information and, more importantly, even if fixity information is available, it is provided by the same archive from which the archived resources are requested. In this research, we propose two approaches, namely Atomic and Block, to establish and check fixity of archived resources.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Archive Assisted Archival Fixity Verification Framework (20)

Advertisement

More from Sawood Alam (20)

Recently uploaded (20)

Advertisement

Archive Assisted Archival Fixity Verification Framework

  1. 1. Archive Assisted Archival Fixity Verification Framework JCDL 2019 Urbana-Champaign, Illinois June 2-6, 2019 Mohamed Aturban, Sawood Alam, Michael L. Nelson, and Michele C. Weigle Old Dominion University Department of Computer Science Norfolk, Virginia 23529 USA
  2. 2. 2 This is what https://climate.nasa.gov/vital-signs/carbon-dioxide/ looks like right now Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  3. 3. 3 The Internet Archive allows us to view previous versions (mementos) of that page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  4. 4. 4 http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  5. 5. https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/ An archived page (memento) from July 2018 5 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  6. 6. 6 The page is in other web archives For a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml Typical archive URI construction: wayback.example.org/SomeString/climate.nasa.gov/vital-signs/carbon-dioxide 4,172 62 3 13 webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/ arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/ perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/ archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/ wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/ web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ Mementos available 3 39
  7. 7. 7 The web page is archived by Michael’s Evil Wayback in July 2017 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  8. 8. 8 Replaying the same memento in October 2017, we got a different CO2 Michaelsevilwayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  9. 9. 9 Which one is the real memento? July 2017 October 2017 • How to ensure that a memento has remained unaltered since the time it was captured by the archive? Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Michael_evil_wayback/web/20170717185130/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  10. 10. 10 Cryptographic hashes to create fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change output SHA256(HTML) 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256(HTML)
  11. 11. 11 Compute a hash value on the downloaded HTML $ curl -s https://climate.nasa.gov/vital- signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec 1209c4ad4efbe2c035d Compute SHA256 hashDownload the page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  12. 12. Time HTML content is downloaded e834 c71a efda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2 12 Compare the current hash with previously calculated hash To verify the fixity Hashes are NOT identical à the page has changed! http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
  13. 13. Two approaches for verifying the fixity of archived web pages 13 • The Atomic approach • Generate a manifest file (a JSON file containing the fixity information) for each memento • Publish the manifest at a well-known web location • Disseminate the published manifest to several archives • The Block approach • Batch together fixity information of multiple mementos in a single binary-searchable file (or block) • Publish the block at a well-known web location • Disseminate the published block to several archives (Use web archives to monitor web archives) Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  14. 14. Atomic approach (step 1): Push a web page into multiple archives 14 https://archive.is/20181224085310/ https://2019.jcdl.org/ https://web.archive.org/web/201812 24085329/https://2019.jcdl.org/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ http://www.webcitation.org74tsy6pU0 https://2019. jcdl.org/ This results in creating multiple mementos of the web page Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  15. 15. Atomic approach (steps 2 & 3): For each memento, compute fixity “manifest” and publish it on the web at a well-known Archival Fixity server 15 manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 • In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server • Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws-dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
  16. 16. manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 The manifest’s generic URI always redirects to the most recent time-specific manifest version (trusty URI) $curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|^location:)" HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/2018122409302 4/8c31ccfbb3a664c9160f98be466b7c9fb9a fa80580ab5052001174be59c6a73 a/https://web.archive.org/web/20181224085329/https://2019.jcdl.org/ HTTP/2 200 manifest’s trusty URI manifest’s generic URI The structure of generic URIs is easy to remember <Archival-Fixity-Server>/<URI to memento> So they can be used to look up manifests in both the Archival Fixity server and archives 16
  17. 17. { "@context": "http://manifest.ws-dl.cs.odu.edu/", "created": "Sun, 23 Dec 2018 11:43:55 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv e.org/web/2018121102034/https://2019.jcdl .org/", "uri-r": "https://2019.jcdl.org/", "uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/", "memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT", "http-headers": { "Content-Type": "text/html; charset=UTF-8", "X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT", "X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>; rel="https://api.w.org/"", "Preference-Applied": "original-links, original-content” }, "hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive- Orig-date $X-Archive-O rig-link') | tee >(sha256sum) >(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’ <(echo -e 'md5nsha256') - | paste -d' ' - -", "hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9 a87f49646456369e51791a86106a6667630737a6" } A manifest file example • Including how hashes are computed • Hashes are computed on only base HTML file • Compute fixity on things that should not change like certain original HTTP response headers Trusty URI 17 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  18. 18. Atomic approach (step 4): Push manifests into multiple archives • In this example, the memento is in the Internet Archive (IA) and its fixity information is disseminated to four archives including IA • An attacker would have to hack a majority of 4 domains (archives) https://archive.is/20181224093334/http://manifest. ws-dl.cs.odu.edu/manifest/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ https://web.archive.org/web/20181224093355/http:// manifest.ws-dl.cs.odu.edu/manifest/https://web.arc hive.org/web/20181224085329/https://2019.jcdl.org/ https://perma-archives.org/warc/20181224093354/htt p://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl. org/ http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/ manifest/https://web.archi ve.org/web/20181224085329/ https://2019.jcdl.org/ 18 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  19. 19. Block approach (step 1): Batch together fixity information of multiple mementos in a single file (block) • Adding additional metadata (e.g., created_at, fields, …) • The hash of the previous block must be added !context ["http://oduwsdl.github.io/contexts/fixity"] !fields {keys: ["surt"]} !id {uri: "https://manifest.ws-dl.cs.odu.edu/"} !meta {created_at: "20190111181327"} !meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"} !meta {type: "FixityBlock"} org,archive,web)/web/19961022175434/http://search.com org,archive,web)/web/19961219082428/http://sho.com org,archive,web)/web/19961223174001/http://reference.com … 19 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  20. 20. Block approach (step 2): Publish the block file at the Archival Fixity server always redirects to the latest published block manifest.ws-dl.cs.odu.edu/blocks The blocks entrypoint 20 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  21. 21. Block approach (step 3): Push the blocks entrypoint into multiple archives https://manifest.ws-dl .cs.odu.edu/blocks https://web.archive.org/web/20190121054059 /https://manifest.ws-dl.cs.odu.edu/blocks/7bbf 757046ac0a0a60015a1cb847c3189160d18c809 b210073822df157609e01 • Will result in archiving the latest published block as well https://perma.cc/8YG3-X7KN 21 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  22. 22. Three steps to verify the fixity of a memento 1. Discover a manifest/block • In Atomic approach, this includes discovering the archived manifests 2. Compute current fixity information of the memento 3. Compare current fixity information with the discovered manifests/block. $ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/ 20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)” HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38 d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201 71115140705/http://rln.fm/ An example of discovering the latest manifest in the Archival Fixity server for the memento web.archive.org/web/2017111 5140705/http://rln.fm/ 22 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  23. 23. Evaluation • A data set of 1,000 mementos from the Internet Archive • For each memento, we generated and disseminated 3 manifests to 4 archives 23 • The average size of a manifest file is 1,157 bytes • The manifest size represents 2.79% of the actual download HTML Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  24. 24. 24 Increasing the number of records per block reduces the block generation time Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  25. 25. 25 The Block approach creates fewer resources in archives than the Atomic approach • Given a collection of N = 1,000 mementos • K = 4 web archives • The selected block size B = 100 records per block • The total number of resources created in the archives: • Atomic (N ∗ K) = 4,000 • Block (k ∗ (N/B)) = 40 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  26. 26. Dissemination/download time varies from one archive to another 26 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  27. 27. It takes 1.25x, 4x and 36x longer to disseminate a manifest to perma.cc, archive.org, and webcitation.org, respectively, than archive.is 27 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  28. 28. It takes 3.5x longer to disseminate a manifest to archive.org than perma.cc 28 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  29. 29. Average time for discovering published fixity information 29 Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  30. 30. The Block approach performs 4.46x faster than the Atomic approach in verifying the fixity of mementos 30 • The fixity verification time includes: • Discovering manifests • Computing current fixity information • Downloading the archived manifests • Comparing results • On average, the verification time of a memento is 6.65 seconds by the Atomic approach and 1.49 seconds by the Block approach Archive Assisted Archival Fixity Verification Framework · JCDL 2019 · June 4, 2019 · Urbana-Champaign, Illinois · @WebSciDL
  31. 31. Conclusions 31 • The proposed approaches do not require any changes in the infrastructure of web archives • The Block approach creates fewer resources in archives and reduces fixity verification time (4.46x faster than the Atomic approach) • The Atomic approach has the ability to verify the fixity of archived pages even without using the Archival Fixity server • Varying/increasing the block size could be one potential solution to improve the Block approach performance and reduce the number of resources created in archives • Caching archived manifests/blocks in the Archival Fixity server should improve the performance of both approaches

×