Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

A Framework for Verifying the Fixity of Archived Web Resources

Download to read offline

My PhD defense presentation slides, July 23, 2020.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

A Framework for Verifying the Fixity of Archived Web Resources

  1. 1. PhD Dissertation Defense for: Mohamed Aturban Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Jian Wu, Sampath Jayarathna, and M'Hammed Abdous A Framework for Verifying the Fixity of Archived Web Resources Department of Computer Science Norfolk, Virginia 23529 USA July 23, 2020 PhD Dissertation Defense for: Mohamed Aturban Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Jian Wu, Sampath Jayarathna, and M'Hammed Abdous A Framework for Verifying the Fixity of Archived Web Resources Department of Computer Science Norfolk, Virginia 23529 USA July 23, 2020
  2. 2. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 2 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  3. 3. This is what www.cnn.com looks like today 33 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  4. 4. The Internet Archive (IA) allows us to view previous versions (mementos) of that page • IA is the world’s largest public web archive • It holds hundreds of billions of archived web pages https://web.archive.org/web/20130401000000*/http://www.cnn.com/ PhD Defense: Mohamed Aturban July 23, 2020 4
  5. 5. The CNN archived page from May 30, 2013 • Replaying this memento in 2018 • There was a thunderstorm in Atlanta, GA on May 30, 2013 5
  6. 6. 6 When reloading (#1) the memento in the browser, the weather icon changed to “cloudy”
  7. 7. 7 When reloading (#2) the memento in the browser, the weather icon changed to “partly sunny”
  8. 8. When reloading (#3) the memento in the browser, the weather icon changed to “partly sunny” 8
  9. 9. Replaying the same memento multiple times does not always produce the same results! • The changes on the playback of this mementos are caused by JavaScript (JS) being executed on the client side (e.g., the browser) • In this example, each time JS is executed, it loads randomly one of the weather icons 9
  10. 10. Textbooks vs. archived pages https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR4FM1VszineUIBCFEQchQTnaZWwKJE7BoUU1u1h3fmrbLdpWl8 A book in a library Replayed mementos 10 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  11. 11. This is what climate.nasa.gov/vital-signs/carbon-dioxide/ looks like today 11
  12. 12. This is what it looked like in July 2018 12 https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/ A memento created by the Internet Archive in July 2018. It is replayed now (2019).
  13. 13. 13 The page in other web archives web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide4,870 archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/13 wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/91 perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/4 arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide3 Typical archive URI construction: archive.example.org/archive-collection/climate.nasa.gov/vital-signs/carbon-dioxide webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide5 Mementos for a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
  14. 14. What if we checked these archives? What if they all agree? Would you trust the results? breitbart.com/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/ infowars.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ MichaelsEvilWayback.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ InternetResearchAgency.ru/climate.nasa.gov/vital-signs/carbon-dioxide/ 14 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  15. 15. 15 The web page is archived July 2017 by Michael’sEvilWayback Which one is the real memento? Replayed in August 2017 Replayed in October 2017 15
  16. 16. 16 It is important to verify fixity of archived resources Evidentiary purposes in court cases • Marten Transport v. PlatForm Advertising • Telewizja Polska USA, Inc. v. Echostar Satellite Corp • St. Luke’s Cataract & Laser Institute v. James C. Sanderson • https://www.bloomberglaw.com/public/desktop/document/Marten_Transp_Ltd_v_PlattForm_Adver_Inc_No_142464JWL_2016_BL_1371?1462657373 • https://casetext.com/case/telewizja-polska-usa-4 • https://caselaw.findlaw.com/us-11th-circuit/1351498.html • https://web.stanford.edu/~gentzkow/research/fakenews.pdf • https://www.nytimes.com/2016/12/05/us/politics/-michael-flynn-trump-fake-news-clinton.html • https://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17 • https://www.newyorker.com/magazine/2015/01/26/cobweb • https://www.datarefuge.org • http://eotarchive.cdlib.org Preserving fake news and important news articles • H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,” Journal of Economic Perspectives, vol. 31, no. 2, pp. 211–36, 2017. • M. Rosenberg, “Trump Adviser Has Pushed Clinton Conspiracy Theories,” The New York Times, 2016 Providing information about certain incidents or crimes • A. Bright, “Web evidence points to pro-Russia rebels in downing of MH17,” The Christian Science Monitor, 2014 Preserving federal and government data • The Data Refuge project is an attempt to preserve federal climate and environmental data • The End of Term Web Archive preserves U.S. Government websites around every new presidential election 16
  17. 17. A disclaimer from the Internet Archive stating that the archive is not responsible for the reliability of the archive resources https://archive.org/about/terms.php 1717
  18. 18. Web pages change on the live web Time Live Web May 2016 April 2017 April 2018 18 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  19. 19. Archives make copies of web pages Live Web Archive May 2016 April 2017 April 2018 Time 19 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  20. 20. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 Time 20 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  21. 21. Do archived pages change? Live Web Archive Replay May 2016 When replaying the archived page at different points in time, will we get the same content? April 2017 April 2018 Time 21
  22. 22. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 22 Time When replaying the archived page at different points in time, will we get the same content?
  23. 23. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 23 Time When replaying the archived page at different points in time, will we get the same content?
  24. 24. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 24 Time When replaying the archived page at different points in time, will we get the same content?
  25. 25. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 25 Time When replaying the archived page at different points in time, will we get the same content?
  26. 26. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 26 Time When replaying the archived page at different points in time, will we get the same content?
  27. 27. Do archived pages change? Live Web Archive Replay May 2016 Our study shows that we are not always presented with the same archived content! ? April 2017 April 2018 27 Time 209
  28. 28. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 28 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  29. 29. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? Research questions 29 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  30. 30. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? RQ2: Given the types of changes identified in the playback of mementos, what steps/guidelines should we follow in order to generate repeatable fixity information (defining an archive-aware fixity-based approach)? Research questions 30 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  31. 31. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? RQ2: Given the types of changes identified in the playback of mementos, what steps/guidelines should we follow in order to generate repeatable fixity information (defining an archive-aware fixity-based approach)? RQ3: How can we store and retrieve fixity information independently from the web archives from which the associated mementos are preserved? Research questions 31 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  32. 32. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 32 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  33. 33. Generating cryptographic hash values (fixity information) • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change output SHA256 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256 33 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  34. 34. 34 SimHash: A small change in the input à a small change in the output 'Klein et al. conducted a study on over one million references from scientific articles and found that 20% articles suffers from Reference Rot, referring to links to web resources that no longer exist or that have significantly modified content.' SimHash 668c8cccd966a785 https://github.com/leonsim/simhash 'Klein et al. conducted a study on over one million references from scientific articles and found that 30% articles suffers from Reference Rot, referring to links to web resources that no longer exist or that have significantly modified content.' SimHash 668c8cced966a785 M. Klein, H. Van de Sompel, R. Sanderson, H. Shankar, L. Balakireva, K. Zhou, and R. Tobin, “Scholarly context not found: One in five articles suffers from reference rot,” PloS one, vol. 9, no. 12, 2014. e115253. http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html • We can use SimHash to compare text-based files and pHash to compare images
  35. 35. 35 An example of a binary hash tree (or Merkle tree) https://brilliant.org/wiki/merkle-tree/ • A leaf nodes = the hash of a block of data • A non-leaf node = the hash of its children
  36. 36. Generate hashes on a web page • Compute a hash value on the downloaded HTML content $ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d Compute SHA256 hashDownload the page 36 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  37. 37. Fixity information Verifying the fixity of a web page Hashes are NOT identical à the page has changed! • Compare the current hash with previously calculated hash 37 Time HTML content is downloaded e834 c71a efda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2
  38. 38. Verifying the fixity of a web page Hashes are NOT identical à the page has changed! • Compare the current hash with previously calculated hash 38 - Users of web archives do not have the ability to easily verify the fixity of mementos. - Most web archives do not allow accessing fixity information - Even if fixity information is available, it is not from an independent archive or service.
  39. 39. What if an image has changed? • Computing hashes on only HTML content will NOT detect changes 39
  40. 40. Potential solution: include all resources in hash calculation • 201 images • 19 JavaScript files • 3 CSS files • Base HTML file A single aggregated hash value Consists of Turns out it is hard to get repeatable hashes on composite mementos A composite memento • www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page) • https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/ • https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html • http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html 40
  41. 41. Archives add banners • To convey information like the number of mementos and inform users that what they are viewing is from the archive • Banners change à different hashes Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos) http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk 41
  42. 42. Archives transform original content to appropriately replay mementos in a user’s browser • Add banners • Rewrite links to point to the archive, not to the live web • Add HTML tags to convey metadata • Archives use one of the Wayback Machine’s implementations to replay mementos • https://archive.org/web/ • https://github.com/iipc/openwayback/wiki • https://github.com/ikreymer/ PyWb 42 @maturban1 • @WebSciDL
  43. 43. Rewriting original content by archives’ replay tools An image A CSS file The page is captured by the Internet Archive: https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html 4343 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  44. 44. Original content https://web.archive.org/web/20190725212938/https://maturban.github.io/ playground/index.html Rewritten Content Add banners 44
  45. 45. Original content https://web.archive.org/web/20190725212938/https://maturban.github.io/ playground/index.html 45 Rewritten Content Example: https://www.odu.edu/images/logo-university.png https://web.archive.org/web/20190725212938im_/https:// www.odu.edu/images/logo-university.png Add banners Rewrite links 45
  46. 46. Original content https://web.archive.org/web/20190725212938/https://maturban.github.io/ playground/index.html 46 Rewritten Content Add HTML tags Add banners Rewrite links 46
  47. 47. Live web The Archive https://web.archive.org/web/20190725212938id_/https://maturban.git hub.io/playground/index.html Raw Mementos • Many archives allow accessing original, or raw, archived content • E.g., using the option id_ after the timestamp 47
  48. 48. We need an archive-aware hashing function suitable for mementos Archive Repeatable hash value? JavaScript Michael’sEvilWayback Transform ation 48 Security
  49. 49. Archive May 2016 April 2017 April 2018 Time Live Web TimeMap • Defined by Memento framework (an Internet RFC) • A TimeMap for an Original Resource “as a resource from which a list of URIs of Mementos of the Original Resource is available.” 49 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020 https://climate.nasa.gov/vital-signs/carbon-dioxide/
  50. 50. Archive May 2016 April 2017 April 2018 Time Live Web The TimeMap of the resource climate.nasa.gov/vital-signs/carbon-dioxide/ has three mementos TimeMap • Defined by Memento framework (an Internet RFC) • A TimeMap for an Original Resource “as a resource from which a list of URIs of Mementos of the Original Resource is available.” 50
  51. 51. Memento aggregators • Aggregate TimeMaps, of an Original Resource, from multiple archives into a single TimeMap • LANL Memento aggregator ⁃ http://mementoweb.org/depot/ ⁃ https://github.com/oduwsdl/MemGator • MemGator 51 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  52. 52. Downloading the TimeMap of climate.nasa.gov/vital- signs/carbon-dioxide/ using MemGator web.archive.org4,870 archive.is13 wayback.archive-it.org91 perma-archives.org4 arquivo.pt3 webarchive.loc.gov5 Mementos http://timetravel.mementoweb.org/timemap/link/climate.nasa. gov/vital-signs/carbon-dioxide/ 52
  53. 53. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 53 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  54. 54. 54 Sampling of Related Work TRAC (2007) Establishing trusted archives - TRAC not for playback Lerner et al. (2017) Vulnerabilities - Discovered four vulnerabilities in the Internet Archive’s Wayback Machine J. Cushman et al. (2017) More potential threats - Demonstrate potential threats in web archives Rosenthal et al. (2005) Threats - Described several threats against digital preservation systems Juan Benet (2017) Multihash - Self identifying hashes for IPFS OriginStamp, Gipp (2015, 2016) Trusted timestamps in Blockchain - Not suitable for composite mementos T. Kuhn et al. (2014) Trusty URI - A URI that contains a hash value of the content it identifies P. Maniatis et al. (2005) Distributed copies of archived resources (LOCKSS) - The scope and content are narrowly defined opentimestamps.org/ (2017) OpenTimestamps - Not suitable for composite mementos chainpoint.org (2017) Chainpoint - Not suitable for composite mementos Collomosse et al. (2018) ARCHANGEL - For mementos, but not suitable for composite mementos Trusted timestampingSecurity Standards and other systems Identity derived from content Hamano et al. (2005) Git, Distributed version control - Uses hash values to create commits identifiers Web archives, such as webcitation.org, and archive.is, use hash values in URIs to identify mementos Brunelle (2010) Live web leakage in archives - Describes how live web leakage changes the representation of mementos Rosenthal et al. (2005) Requirements for establishing trusted digital preservation systems - Not for playback OAIS (2012) Reference Model For An Open Archival Information System (OAIS) - Not for playback 54
  55. 55. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 55 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  56. 56. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? Identifying and quantifying changes on the playback of mementos Collect a dataset of mementos Download rewritten/raw composite mementos Identify changes Present results 39 Times Generate aggregated hash values 1 2 3 4 5 M. Aturban, M. L. Nelson, and M. C. Weigle, “It is hard to compute fixity on archived web pages,” in Proceedings of the Workshop on Web Archiving and Digital Libraries (WADL) held in conjunction with the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018. 56PhD Defense: Mohamed Aturban July 23, 2020
  57. 57. Collecting 16,627 Mementos M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019. • The HTTP Archive: httparchive.org • The Web Archives for Historical Research: uwaterloo.ca/web-archive-group/ • Not all mementos are created equal: measuring the impact of missing resources, J. Brunelle et al. (DOI: doi.org/10.1007/s0079) • The Moz Top 500 Websites: moz.com/top500 Sources of URI-Rs: Collect a dataset of mementos 1 57
  58. 58. Downloading mementos using Headless Chrome http://web.archive.org/web/19961120150251/http://www.usnews.com:80/ https://github.com/N0taN3rd/Squidwarc rewritten.warc Squidwarc Download rewritten/raw composite mementos 2 58
  59. 59. 59 Extract all URI-Ms by reading WARC records using the tool warcio Download rewritten/raw composite mementos 2 rewritten.warc https://github.com/webrecorder/warcio
  60. 60. 60 Requesting the raw mementos of x ✓ ✓ ✓ x x x x x x x x x x x ✓ x x x x ✓ 200 Ok (or archival 4xx/5xx)✓ raw.warc Using id_ X = Archive-specific resources X = 3xx Redirect Download rewritten/raw composite mementos 2
  61. 61. 61 Generate the aggregated hash with Merkle trees Generate aggregated hash values 3
  62. 62. 62 Identifying types of changes on the playback of mementos Set: One or more resources in the set comprising a composite memento has changed Status code: The HTTP status code of one or more resources comprising a composite memento has changed HTTP Headers: One or more HTTP response headers, that we do not expect to change, has changed Representation: The returned HTTP entity body of one or more resources comprising a composite memento has changed URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-DateTime Identify changes 4
  63. 63. 63 Set: One or more resources in the set comprising a composite memento has changed https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Reload # 1 Identify changes 4 63
  64. 64. https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Reload # 2 Set: One or more resources in the set comprising a composite memento has changed Identify changes 4 64
  65. 65. https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Reload # 3 Set: One or more resources in the set comprising a composite memento has changed Identify changes 4 65
  66. 66. A resource selected randomly by JavaScript Reload # 3 function random_imglink(){ myimages[1]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home- banner/open-spaces/bannerbluemnt.jpg"; myimages[2]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home- banner/open-spaces/bannereagle.jpg"; myimages[3]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home- banner/open-spaces/bannertiger.jpg"; var ry=Math.floor(Math.random(1)*myimages.length) if (ry==0) ry=1 document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'" border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to Read"></a>’) } https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Identify changes 4 66
  67. 67. Status code: The HTTP status code of one or more resources comprising a composite memento has changed https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/ 404 200 Identify changes 4 67
  68. 68. Status code: The HTTP status code of one or more resources comprising a composite memento has changed https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/ 404 200 WARC/1.0 WARC-Type: response WARC-Target-URI: https://web.archive.org/save/_embed/http://wac.450F.edgecas tcdn.net/80450F/noisecreep.com/files/2009/06/aaron_a042209eb _200.jpg WARC-Date: 2017-11-18T10:33:14Z … HTTP/1.1 200 OK Date: Sat, 18 Nov 2017 10:32:51 GMT Content-Type: image/jpeg Content-Location: /web/20171118103250/http://wac.450F.edgecastcdn.net/80450F/n oisecreep.com/files/2009/06/aaron_a042209eb_200.jpg Observations change archives Identify changes 4 68
  69. 69. Headers: One or more HTTP response headers, that we do not expect to change, has changed https://web.archive.org/web/20071111211818/http:// images.sohu.com:80/chat_online/market/sohu/140140-1.html Replayed in 2017 Replayed in 2018 Identify changes 4 69
  70. 70. Representation: The returned HTTP entity body of one or more resources comprising a composite memento has changed curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:45 GMT <a href="/cdn-cgi/l/email- protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464 54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1 207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” … curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:50 GMT <a href="/cdn-cgi/l/email- protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060 50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5 247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” … Requesting the raw version, a third party service (Cloudflare) modifies the HTML Identify changes 4 70
  71. 71. Live Web Archive Replay May 2016 April 2017 April 2018 Time URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 71PhD Defense: Mohamed Aturban July 23, 2020
  72. 72. 72 Live Web Archive Replay May 2016 April 2017 April 2018 Time URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 72PhD Defense: Mohamed Aturban July 23, 2020
  73. 73. 73 Live Web Archive Replay May 2016 April 2017 April 2018 Time URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 73PhD Defense: Mohamed Aturban July 23, 2020
  74. 74. Live Web Archive Replay May 2016 April 2017 April 2018 Time X 302 Redirect URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 74PhD Defense: Mohamed Aturban July 23, 2020
  75. 75. https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 Identify changes 4 75
  76. 76. 76 https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 URI-M1 was NOT available Identify changes 4
  77. 77. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113 /name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ 77 Identify changes 4
  78. 78. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113 /name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M2 = perma-archives.org/warc/20170619145458id_/http://umich.edu/includes/image/type/gallery/id/113 /name/ResearchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M1 was NOT available Different image 78 Identify changes 4
  79. 79. 79 88.45% of 16,627 mementos produce at least two different hashes Present results 5
  80. 80. 80 One in eight mementos (11.55%) always produce the same hash and one in six mementos (16.06%) produce a different hash on each replay blue=11.55% (1,920 mementos) red=16.06% (2,670 mementos) Present results 5
  81. 81. The types of changes affecting mementos after each download Present results 5 81
  82. 82. Migrated and missing mementos (11.91%) Present results 5 • https://ws-dl.blogspot.com/2019/08/2019-08-30-where-did-archive-go-part1.html • https://ws-dl.blogspot.com/2019/09/2019-09-10-where-did-archive-go-part-2.html • https://ws-dl.blogspot.com/2019/09/2019-09-25-where-did-archive-go-part-3.html • https://ws-dl.blogspot.com/2019/10/2019-10-21-where-did-archive-go-part-4.html Our blog posts about the movements of mementos: 82
  83. 83. Because most mementos produce multiple aggregated hash values over time, we introduce two additional hashing techniques • URI-M-based hashing technique Only URI-Ms of mementos comprising a composite memento are used in the hash calculation • Entity-based hashing technique Only HTTP entity bodies of mementos comprising a composite memento are used in the hash calculation 83 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  84. 84. 84 Complete hashing 84
  85. 85. 8585 URI-M-based hashing x ✓x x x x x ✓x x x x
  86. 86. 8686 Entity-based hashing x ✓ x x x x x ✓ x x x x
  87. 87. 87 Expected results Complete hashing on mementos from archive.org
  88. 88. 88 Expected results Complete hashing on mementos from archive.org
  89. 89. 89 Complete hashing on mementos from archive.org Expected results
  90. 90. 90 Complete hashing on mementos from archive.org New hash values calculated in each download (median = 871 hash values) 90 Only 47% of the total number of hash values are seen in Download 1
  91. 91. 91 URI-M-based hashing on mementos from archive.org New URI-Ms requested in each download (median = 806 URI-Ms) 91 Only 50% of the total number of URI-Ms are requested in Download 1
  92. 92. 92 Entity-based hashing on mementos from archive.org New entity bodies observed in each download (median = 116 entity body) 92 About 80% of the total number of entity bodies are seen in Download 1
  93. 93. RQ2: Given the types of changes identified in the playback of mementos, what steps/requirements should we consider in order to generate repeatable fixity information (defining an archive-aware fixity-based approach)? Research questions 93 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  94. 94. Guidelines for generating fixity information on the playback of mementos • We define these guidelines based on results from our 14-month study 94
  95. 95. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 95 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  96. 96. RQ3: How can we store and retrieve fixity information independently from the web archives from which the associated mementos are preserved? M. Aturban, S. Alam, M. L. Nelson, and M. C. Weigle, “Archive Assisted Archival Fixity Verification Framework,” in Proceedings of the 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019. 96 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  97. 97. Two approaches for disseminating and verifying the fixity of archived web pages (using web archives to monitor web archives) • The Atomic approach • Generate a manifest file (a JSON file containing the fixity information) for each memento • Publish the manifest at a well-known location • Disseminate the published manifest to several archives • The Block approach • Batch together fixity information of multiple mementos in a single binary- searchable file (or block) • Publish the block at a well-known location • Disseminate the published block to several archives 97 @maturban1 • @WebSciDL
  98. 98. { "@context": "http://oduwsdl.github.io/contexts/fixity", "created_at": "Wed, 08 Apr 2020 02:22:56 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20200408022256/d5d...f04/https://web.archive.org/web/19970104075414/ http://www.un.org:80/", "composite-memento-uri-m-hash": "8bb453d8aa...db5f5fbf6", "composite-memento-entity-hash": "cdf47062a...3ebe030b0", "composite-memento-overall-hash": "69ca5a85...b9206930", "uri-m": "https://web.archive.org/web/19970104075414/http://www.un.org:80/", "resources": [ { "http-headers": ["X-Archive-Orig-last-modified", "Content-Type", "X-Archive-Orig-date", "Memento-Datetime"], "entity-phash": null, "entity-hash": "ba140a5bede7f10bea0...7514725862eda82a003", "overall-hash": "3e4133b3766c2a58d6f...23f5f95a206a1ba9878", "entity-simhash": 9695187482751709335, "uri-m": "https://web.archive.org/web/19970104075414/http://www.un.org:80/", "status-code": "200"}, { "http-headers": ["X-Archive-Orig-last-modified", "X-Archive-Orig-date", "Memento-Datetime", "Content-Type"], "entity-phash": "87d5798529a75a58", "entity-hash": "d9305a4da88570700d92...17c0c0f72b9f4e514b9", "overall-hash": "de002fbe292372199e6...6059044f4695f0dda2c", "entity-simhash": null, "uri-m": "https://web.archive.org/web/19970315165323im_/http://www.un.org/homepage.gif", "status-code": "200"}, { "http-headers": ["Content-Type", "Memento-Datetime"], "entity-phash": "219d1a8362a71040", "entity-hash": "d5fd59c929e1c62b17b...d5e321f64a919e4294e", "overall-hash": "7df9dde47fa5bab643...cef3c84694bb1db8b1c", "entity-simhash": null, "uri-m": "https://web.archive.org/web/20120129120857/http://web.archive.org/screenshot/http://www.un.org/", "status-code": "200"} ] } A manifest example containing fixity information 98
  99. 99. Atomic approach: Push manifests into multiple archives • In this example, the memento is in the Internet Archive (IA) and its fixity information is disseminated to four archives including IA • An attacker would have to hack a majority of 4 domains (archives) https://archive.is/20181224093334/http://manifest. ws-dl.cs.odu.edu/manifest/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ https://web.archive.org/web/20181224093355/http:// manifest.ws-dl.cs.odu.edu/manifest/https://web.arc hive.org/web/20181224085329/https://2019.jcdl.org/ https://perma-archives.org/warc/20181224093354/htt p://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl. org/ http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/ manifest/https://web.archi ve.org/web/20181224085329/ https://2019.jcdl.org/ 99
  100. 100. Block approach: Batch together fixity information of multiple mementos in a single file (block) • Adding additional metadata (e.g., created_at, fields, …) • The hash of the previous block must be added !context ["http://oduwsdl.github.io/contexts/fixity"] !fields {keys: ["surt"]} !id {uri: "https://manifest.ws-dl.cs.odu.edu/"} !meta {created_at: "20190111181327"} !meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"} !meta {type: "FixityBlock"} org,archive,web)/web/19961022175434/http://search.com org,archive,web)/web/19961219082428/http://sho.com org,archive,web)/web/19961223174001/http://reference.com … 100
  101. 101. Block approach: Push the blocks entrypoint into multiple archives manifest.ws-dl.cs.odu.edu/blocks https://web.archive.org/web/20190121054059/https ://manifest.ws-dl.cs.odu.edu/blocks/7bbf757046ac 0a0a60015a1cb847c3189160d18c809b210073822df15760 9e01 • Will result in archiving the latest published block as well https://perma.cc/8YG3-X7KN 101 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  102. 102. Three steps to verify the fixity of a memento 1. Discover a manifest/block • In Atomic approach, this includes discovering the archived manifests 2. Compute current fixity information of the memento 3. Compare current fixity information with the discovered manifests/block. $ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/ 20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)” HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38 d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201 71115140705/http://rln.fm/ An example of discovering the latest manifest in the Archival Fixity server for the memento: https:/web.archive.org/web/20171115140705/http://rln.fm/ 102
  103. 103. Evaluation • A data set of 16K mementos from 17 public web archives • For each memento, we generated and disseminated a manifest to 3 archives - The median size of a composite memento is 1143.85 KB - The median size of a manifest file is 15.29 KB, which represents 1.33% of the size of a composite memento 103
  104. 104. Increasing the number of records per block reduces the block generation time 104
  105. 105. The Block approach creates fewer resources in archives than the Atomic approach • Given a collection of N = 16,608 mementos • Katomic = 3 web archives • Kblock = 2 web archives • The selected block size B = 1038 records per block • The total number of resources created in the archives by each approach: Atomic (N ∗ Katomic) = 49,824 Block (Kblock ∗ (N/B)) = 32 105 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  106. 106. It takes 1.09X and 4.54X longer to disseminate a manifest to perma.cc, archive.org, respectively, than archive.is 106 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  107. 107. It takes 9.2x longer to disseminate a block to archive.org than perma.cc 107 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  108. 108. The Block approach performs 1.05X faster than the Atomic approach on verifying the fixity of mementos Discovering and downloading manifest files in the Atomic/Block approaches per archive Verifying mementos by both approaches 108
  109. 109. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 109 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  110. 110. Contributions • RQ1 - Four methods for collecting mementos (arXiv’19) - Identified and quantified types of changes on the playback of mementos (JCDL/WADL’18) - Showed examples of missing mementos • RQ2 - The two hashing techniques (URI-M-based and entity-based) - The archive-aware hashing function (arXiv’17) • RQ3 - ArchiveNow, a tool for disseminating web pages in public web archives (JCDL’18) - A framework for disseminating fixity information to web archives (JCDL’19) 110 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  111. 111. Future Work Investigating web packaging generating fixity information using • Web packaging is an emerging standard • It should allow archives to deliver a composite memento in a single HTTP response or in a self-contained file • Using web packaging we can download a composite memento, packaged in a bundle, with a single HTTP request. This should reduce playback-related changes, such as transient errors and URI-M changes. 111PhD Defense: Mohamed Aturban July 23, 2020
  112. 112. Conclusions • Conventional hashing techniques are not suitable for replayed archived web pages. • We defined an archive-aware hashing function that consists of multiple guidelines (based on our 14-month study on 16K mementos) • Fixity information includes (1) Multiple aggregated hash values generated using different hashing techniques (URI-M-based and entity-based hashing) (2) Multiple hash values generated on each resource comprising a composite memento • We introduce two approaches for disseminating fixity information to web archives 112 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  113. 113. The archive-aware hashing function 113 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  114. 114. PhD Dissertation Defense for: Mohamed Aturban Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Jian Wu, Sampath Jayarathna, and M'Hammed Abdous A Framework for Verifying the Fixity of Archived Web Resources Department of Computer Science Norfolk, Virginia 23529 USA July 23, 2020
  115. 115. Appendix
  116. 116. Collecting 16,627 Mementos M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019. • The HTTP Archive: httparchive.org • The Web Archives for Historical Research: uwaterloo.ca/web-archive-group/ • Not all mementos are created equal: measuring the impact of missing resources, J. Brunelle et al. (DOI: doi.org/10.1007/s0079) • The Moz Top 500 Websites: moz.com/top500 Sources of URI-Rs: Collect a dataset of mementos 1 116
  117. 117. http://collections.internetmemory.org/nli/ 20121223031837/http://www2008.org/ • Mementos from the National Library of Ireland (NLI) collection has been moved from collections.internetmemory.org/nli/ to wayback.archive-it.org/10702/ An example of a missing memento • The URI-M was 200 OK in September 2018 http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/ • The URI-M is now 404 Not Found 117 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  118. 118. 118 - The heatmap shows archive-level changes by comparing consecutive downloads of mementos - It visualizes the overall performance of each archive - It identifies points in time where major changes occur Present results 5
  119. 119. URI-Rs with different path lengths and URI-Ms with different Memento-Datetime 119 M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019. URI-Ms per year URI-Rs per path length Select a dataset of mementos 1
  120. 120. 120
  121. 121. Downloading the ZIP file of a memento at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. http://archive.is/download/BRWpm.zip http://archive.is/BRWpm Representation: The returned HTTP entity body of one or more resources comprising a composite memento has changed Identify changes 4 121
  122. 122. Downloading the ZIP file of a memento at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/ http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg Representation (transient errors) Identify changes 4 122
  123. 123. http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/ http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchives.gov.uk/20 170303010736id_/https://cereals.ahdb.org.uk/ media/1157842/corporate-strategy-1.jpg WARC-Date: 2017-12-07T10:04:18Z … Content-Length: 459640 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 07 Dec 2017 10:04:18 GMT … The first Content-length should be bigger than the second one WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchives.gov.uk/ 20170303010736id_/https://cereals.ahdb.org .uk/media/1157842/corporate-strategy-1.jpg WARC-Date: 2017-11-16T15:34:37Z … Content-Length: 643398 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 16 Nov 2017 15:34:36 GMT … This is what it should look like Identify changes 4 Downloading the ZIP file of a memento at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. Representation (transient errors) 123
  124. 124. 124 Identify changes 4 Archives react differently to requests for raw mementos
  125. 125. The Block approach performs 4.46x faster than the Atomic approach in verifying the fixity of mementos • The fixity verification time includes: - Discovering manifests - Computing current fixity information - Downloading the archived manifests - Comparing results • On average, the verification time of a memento is 6.65 seconds by the Atomic approach and 1.49 seconds by the Block approach @maturban1 • August 22, 2019 A Framework for Verifying the Fixity of Archived Web Resources
  126. 126. { "@context": "http://manifest.ws-dl.cs.odu.edu/", "created": "Sun, 23 Dec 2018 11:43:55 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv e.org/web/2018121102034/https://2019.jcdl .org/", "uri-r": "https://2019.jcdl.org/", "uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/", "memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT", "http-headers": { "Content-Type": "text/html; charset=UTF-8", "X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT", "X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>; rel="https://api.w.org/"", "Preference-Applied": "original-links, original-content” }, "hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive- Orig-date $X-Archive-O rig-link') | tee >(sha256sum) >(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’ <(echo -e 'md5nsha256') - | paste -d' ' - -", "hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9 a87f49646456369e51791a86106a6667630737a6" } A manifest file example • Including how hashes are computed • Hashes are computed on only base HTML file • Compute fixity on things that should not change like certain original HTTP response headers 126
  127. 127. 127 • Using web packaging we can download a composite memento, packaged in a bundle, with a single HTTP request. This should reduce playback-related changes, such as transient errors and URI-M changes.
  128. 128. 128
  129. 129. Memento framework • Uses time as a dimension to access the web by relating current web resources to their prior states • Is supported by most public web archives including the Internet Archive http://mementoweb.org/guide/quick-intro/ 129 @maturban1 • @WebSciDL
  130. 130. 130 URI-M-based hashing on mementos from archive-it.org Expected results Actual results
  131. 131. 131 New URI-Ms are requested in each download URI-M-based hashing on mementos from archive-it.org Actual results
  132. 132. 132 New URI-Ms are requested in each download URI-M-based hashing on mementos from archive-it.org Expected results Actual results
  133. 133. 133 Atomic approach (step 1): Push a web page into multiple archives https://archive.is/20181224085310/ https://2019.jcdl.org/ https://web.archive.org/web/201812 24085329/https://2019.jcdl.org/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ http://www.webcitation.org74tsy6pU0 https://2019. jcdl.org/ This results in creating multiple mementos of the web page Archive Assisted Archival Fixity Verification Framework ∙ JCDL 2019 ∙ June 4, 2019 ∙ Urbana-Champaign, Illinois
  134. 134. Atomic approach (steps 2 & 3): For each memento, compute fixity “manifest” and publish it on the web at a well-known location (Archival Fixity server) manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 • In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server • Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws- dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html 134
  135. 135. manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 The manifest’s generic URI always redirects to the most recent time-specific manifest version (trusty URI) $curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|^location:)" HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9 160f98be466b7c9fb9a fa80580ab5052001174be59c6a73a/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ HTTP/2 200 manifest’s trusty URI manifest’s generic URI • The structure of generic URIs is easy to remember: <Archival-Fixity-Server>/<URI to memento> So they can be used to look up manifests in both the Archival Fixity server and archives 135 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  136. 136. Block approach (step 2): Publish the block file at the Archival Fixity server manifest.ws-dl.cs.odu.edu/blocks The blocks entrypoint always redirects to the latest published block 136
  137. 137. The dissemination/download time varies from one archive to another 137 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  138. 138. 138
  139. 139. Links 139 • Animated GIFs: https://github.com/oduwsdl/mementos-fixity/tree/master/hashing_techniques
  140. 140. Libyan food 140 Tajine Couscous https://cookingthyme.wordpress.com/2014/07/19/tajeen-jban- cheese-tagine-%D8%B7%D8%A7%D8%AC%D9%8A%D9%86- %D8%AC%D8%A8%D9%86/ https://www.daringgourmet.com/kusksu-libyan-couscous-with-spicy-beef-and-vegetables/ Baklawa https://www.pinterest.ie/pin/612067405578694390/

My PhD defense presentation slides, July 23, 2020.

Views

Total views

310

On Slideshare

0

From embeds

0

Number of embeds

236

Actions

Downloads

1

Shares

0

Comments

0

Likes

0

×