Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

It is hard to compute fixity on archived web pages

Download to read offline

Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

It is hard to compute fixity on archived web pages

  1. 1. It is hard to compute fixity on archived web pages Mohamed Aturban Advisor: Michele C. Weigle Co-advisor: Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @maturban1 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  2. 2. 2 climate.nasa.gov/vital-signs/carbon-dioxide/ looks like this right now
  3. 3. 3 The Internet Archive allows us to view previous versions (mementos) of that page
  4. 4. http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-
  5. 5. https://web.archive.org/web/20160708040004/https://climate.nasa.gov/vital-signs/carbon-dioxide/ An archived page (memento) from July 2016
  6. 6. https://web.archive.org/web/20160708040004/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Live web page vs. archived web page https://climate.nasa.gov/vital-signs/carbon-dioxide/ July 2016 Now
  7. 7. 7 Web pages change on the live web? Time Live Web May 2016 April 2017 April 2018 climate.nasa.gov/vital-signs/carbon-dioxide/
  8. 8. 8 Archives make copies of web pages Time Live Web Archive May 2016 April 2017 April 2018 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  9. 9. 9 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  10. 10. 10 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  11. 11. 11 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  12. 12. 12 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  13. 13. 400.15 ppm 13 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  14. 14. 400.15 ppm 14 Do archived pages change? Time Live Web Archive Replay May 2016 Our study shows that we are not always presented with the same archived content! ? April 2017 April 2018
  15. 15. 15 Cryptographic hashes to create fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input  a large change output SHA256 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  16. 16. 16 Generate hashes on an archived page • Compute a hash value on the downloaded HTML content % curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d Compute SHA256 hash Download the page Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  17. 17. 17 What if an image has changed? Computing hashes on only HTML content will NOT detect changes
  18. 18. 18 Potential solution: include all resources in hash calculation https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/ • 201 images • 19 JavaScript files • 3 CSS files • Main HTML file A single aggregated hash value www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page ) has A composite memento https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html Turns out it is hard to get repeatable hashes on composite mementos
  19. 19. 19 Archives transform original content to appropriately replay mementos in a user’s browser • Add banners • Rewrite links to point to the archive, not to the live web • Modify HTML code to convey metadata • Apply some policies for security (e.g., block some content) • Provide the content in different format (e.g., ZIP and screenshots) Transformation examples: Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  20. 20. 20 Archives add banners • To convey information like the number of mementos and inform users that what they are viewing is from the archive • Banners change  different hashes Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos) http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  21. 21. 21 Archives rewrite links to embedded resources web.archive.org/web/19961120150251 /http://www.usnews.com:80/ http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif http://www.usnews.com:80/usnews/GRAPHICS/logo.gif Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  22. 22. 22 Live web resources linked from archives • Resources from the live web are expected to change  different hashes • Based on feedback from Lerner et al., IA solved this issue with Content- Security-Policy HTTP header, but the problem might still occur in other archives http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Archived in 2008 The ad is from 2012 This memento was replayed in 2012 A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer and Communications Security (CCS), pages 1741–1755, 2017.
  23. 23. 23 Caches may temporarily hide changes in the playback % date Mon Oct 2 01:15:18 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:16:29 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:19:31 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 02:10:24 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 dda6a9bf091d412cbdc2226ce3eb1059 X-Page-Cache: MISS X-Page-Cache: HIT X-Page-Cache: MISS X-Page-Cache: HIT
  24. 24. 24 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  25. 25. 25 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  26. 26. 26 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  27. 27. 27 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  28. 28. 28 Dynamic content by JS  different hashes A large number of mementos are unavailable Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  29. 29. 29 A resource selected randomly by JavaScript https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  30. 30. 30 https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ A resource selected randomly by JavaScript
  31. 31. 31 https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ A resource selected randomly by JavaScript
  32. 32. 32 https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ A resource selected randomly by JavaScript function random_imglink(){ myimages[1]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg"; myimages[2]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannereagle.jpg"; myimages[3]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannertiger.jpg"; var ry=Math.floor(Math.random(1)*myimages.length) if (ry==0) ry=1 document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'" border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to Read"></a>') }
  33. 33. 400.15 ppm 33 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 A TimeMap = a list of available archived pages
  34. 34. 400.15 ppm 34 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  35. 35. 400.15 ppm 35 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 X 302 Redirect Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  36. 36. Changes in TimeMaps  different HTTP entity  different hashes URI-M1 was NOT available URI-M1 URI-M2 • You can see the difference in the URI-M of the main HTML file web.archive.org/web/20080828005922/http://www.evangelcogdayton.org/ web.archive.org/web/20090211151609/http://www.evangelcogdayton.org/
  37. 37. December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 37 URI-M1 was NOT available URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G Changes in TimeMaps  different image  different hashes • You can't see the difference in the URI-M of the main HTML file, but you can see the difference in the embedded images https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
  38. 38. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M2 = perma-archives.org/warc/20170619145458id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M1 was NOT available Different image Changes in TimeMaps  different image that looks the same  different hashes • You can't see the difference in the URI-M of the main HTML file nor the difference in the embedded images http://perma-archives.org/warc/20170101182813id_/http://umich.edu/ http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
  39. 39. 39 Transient error • Incomplete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg Download the image on December 7, 2017 WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-12-07T10:04:18Z … Content-Length: 459640 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 07 Dec 2017 10:04:18 GMT … The first Content-length should be bigger than the second one
  40. 40. 40 The complete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-11-16T15:34:37Z … Content-Length: 643398 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 16 Nov 2017 15:34:36 GMT … This is what it should look like
  41. 41. Requesting the raw version, a third party service (Cloudflare) injects HTML code curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:45 GMT <a href="/cdn-cgi/l/email- protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464 54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1 207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” … curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:50 GMT <a href="/cdn-cgi/l/email- protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060 50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5 247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” … curl -silent http://perma- archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:51 GMT <a href="/cdn-cgi/l/email- protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d 4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8 39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …
  42. 42. 42 Requirements for generating repeatable hashes 1. Generate a hash on a composite memento 2. Exclude archive-specific resources 3. Avoid resources from the live web 4. Avoid content served from cache 5. Changes in TimeMaps might affect the computation of hashes 6. Avoid including dynamic content https://arxiv.org/pdf/1712.03140.pdf Aturban, M, Nelson, M.L., Weigle, M.C.: Difficulties of Timestamping Archived Web Pages. Tech. Rep. arXiv:1712.03140 (2017) Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  43. 43. 43 Our study indicates that about 88% of mementos produce different hashes • 16,627 archived pages • From 17 public web archives • Downloaded 35 times • Between November 16, 2017 and October 19, 2018 Preliminary work Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  44. 44. 44 Conclusions • We downloaded 16,627 mementos 35 times between November 16, 2017 and October 19, 2018 • Within about 11 months, we found that 88% of mementos produce different hash values • It is hard to get repeatable hashes on the playback of archived web pages because of transient errors, dynamic URI-Ms, and instability of indexes in archives • We need an archive-aware hashing function to produce repeatable hashes • https://www.cs.odu.edu/~maturban/fixity.html For more information:

Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL

Views

Total views

202

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×