It is hard to compute fixity on archived web pages

M
It is hard to compute fixity
on archived web pages
Mohamed Aturban
Advisor: Michele C. Weigle
Co-advisor: Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
@WebSciDL, @maturban1
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
2
climate.nasa.gov/vital-signs/carbon-dioxide/
looks like this right now
3
The Internet Archive allows us to view
previous versions (mementos) of that page
http://web.archive.org/web/*/https://climate.nasa.gov/vital-signs/carbon-
https://web.archive.org/web/20160708040004/https://climate.nasa.gov/vital-signs/carbon-dioxide/
An archived page (memento) from July 2016
https://web.archive.org/web/20160708040004/https://climate.nasa.gov/vital-signs/carbon-dioxide/
Live web page vs. archived web page
https://climate.nasa.gov/vital-signs/carbon-dioxide/
July 2016
Now
7
Web pages change on the live web?
Time
Live
Web
May
2016
April
2017
April
2018
climate.nasa.gov/vital-signs/carbon-dioxide/
8
Archives make copies of web pages
Time
Live
Web
Archive
May
2016
April
2017
April
2018
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
9
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
10
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
When replaying the archived page at different
points in time, will we get the same content?
11
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
When replaying the archived page at different
points in time, will we get the same content?
12
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
When replaying the archived page at different
points in time, will we get the same content?
400.15 ppm
13
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
When replaying the archived page at different
points in time, will we get the same content?
400.15 ppm
14
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
Our study shows that we are not always
presented with the same archived content!
?
April
2017
April
2018
15
Cryptographic hashes to create
fixity information
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input  a large change output
SHA256
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
SHA256
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
16
Generate hashes on an archived page
• Compute a hash value on the downloaded HTML content
% curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/
| shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d
Compute SHA256 hash
Download the page
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
17
What if an image has changed?
Computing hashes on only HTML content will
NOT detect changes
18
Potential solution: include all
resources in hash calculation
https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/
• 201 images
• 19 JavaScript files
• 3 CSS files
• Main HTML file
A single aggregated
hash value
www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page )
has
A composite memento
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
Turns out it is hard to get
repeatable hashes on
composite mementos
19
Archives transform original content to
appropriately replay mementos in a user’s
browser
• Add banners
• Rewrite links to point to the archive, not to
the live web
• Modify HTML code to convey metadata
• Apply some policies for security (e.g., block
some content)
• Provide the content in different format (e.g.,
ZIP and screenshots)
Transformation examples:
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
20
Archives add banners
• To convey information like the number of mementos
and inform users that what they are viewing is from the
archive
• Banners change  different hashes
Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos)
http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
21
Archives rewrite links to embedded
resources
web.archive.org/web/19961120150251 /http://www.usnews.com:80/
http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
http://www.usnews.com:80/usnews/GRAPHICS/logo.gif
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
22
Live web resources linked from archives
• Resources from the live web are expected to change  different hashes
• Based on feedback from Lerner et al., IA solved this issue with Content-
Security-Policy HTTP header, but the problem might still occur in other
archives
http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Archived in 2008
The ad is from 2012
This memento
was replayed in
2012
A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer
and Communications Security (CCS), pages 1741–1755, 2017.
23
Caches may temporarily hide
changes in the playback
% date Mon Oct 2 01:15:18 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 01:16:29 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 01:19:31 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
477b6d923cbb7bf9675a0d2feb37afd3
% date Mon Oct 2 02:10:24 EDT 2017
% curl -s http://web.archive.org/web/20130724144801/htt
p://www.cnn.com/ | md5
dda6a9bf091d412cbdc2226ce3eb1059
X-Page-Cache: MISS
X-Page-Cache: HIT
X-Page-Cache: MISS
X-Page-Cache: HIT
24
Dynamic content by JS  different hashes
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
25
Dynamic content by JS  different hashes
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
26
Dynamic content by JS  different hashes
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
27
Dynamic content by JS  different hashes
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
28
Dynamic content by JS  different hashes
A large number of
mementos are unavailable
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
29
A resource selected randomly by
JavaScript
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
30
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
31
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
32
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
A resource selected randomly by
JavaScript
function random_imglink(){
myimages[1]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg";
myimages[2]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannereagle.jpg";
myimages[3]="/congress112th/20130119060624/http://www.fws.g
ov/home/feature/home-banner/open-spaces/bannertiger.jpg";
var ry=Math.floor(Math.random(1)*myimages.length)
if (ry==0)
ry=1
document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img
src="'+myimages[ry]+'" border="0" alt="The Open Spaces
Blog. A Talk on the Wild Side. Click to Read"></a>')
}
400.15 ppm
33
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
A TimeMap = a list of available archived pages
400.15 ppm
34
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
400.15 ppm
35
Do archived pages change?
Time
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
X 302 Redirect
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
Changes in TimeMaps  different HTTP entity  different hashes
URI-M1 was
NOT available
URI-M1
URI-M2
• You can see the difference in the URI-M of the main HTML file
web.archive.org/web/20080828005922/http://www.evangelcogdayton.org/
web.archive.org/web/20090211151609/http://www.evangelcogdayton.org/
December 2017
March 2018
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
37
URI-M1 was
NOT available
URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
Changes in TimeMaps  different image  different hashes
• You can't see the difference in the URI-M of the main HTML file, but you
can see the difference in the embedded images
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
December 12, 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2December 25, 2017
URI-M1 = perma-archives.org/warc/20170101182814id_/
http://umich.edu/includes/image/type/gallery/id/113/name/Resea
rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M2 = perma-archives.org/warc/20170619145458id_/
http://umich.edu/includes/image/type/gallery/id/113/name/Resea
rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M1 was
NOT available
Different image
Changes in TimeMaps  different image that looks the same  different hashes
• You can't see the difference in the URI-M of the main HTML file nor the
difference in the embedded images
http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
39
Transient error
• Incomplete HTTP
entity
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
Download the image on December 7,
2017 WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchive
s.gov.uk/20170303010736id_/https:
//cereals.ahdb.org.uk/media/11578
42/corporate-strategy-1.jpg
WARC-Date: 2017-12-07T10:04:18Z
…
Content-Length: 459640
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 07 Dec 2017 10:04:18
GMT
…
The first
Content-length
should be bigger
than the second
one
40
The complete HTTP entity
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchive
s.gov.uk/20170303010736id_/https:
//cereals.ahdb.org.uk/media/11578
42/corporate-strategy-1.jpg
WARC-Date: 2017-11-16T15:34:37Z
…
Content-Length: 643398
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 16 Nov 2017 15:34:36
GMT
…
This is what it
should look like
Requesting the raw version, a third party
service (Cloudflare) injects HTML code
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:45 GMT
<a href="/cdn-cgi/l/email-
protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464
54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1
207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” …
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:50 GMT
<a href="/cdn-cgi/l/email-
protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060
50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5
247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” …
curl -silent http://perma-
archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:51 GMT
<a href="/cdn-cgi/l/email-
protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d
4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8
39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …
42
Requirements for generating
repeatable hashes
1. Generate a hash on a composite memento
2. Exclude archive-specific resources
3. Avoid resources from the live web
4. Avoid content served from cache
5. Changes in TimeMaps might affect the
computation of hashes
6. Avoid including dynamic content
https://arxiv.org/pdf/1712.03140.pdf
Aturban, M, Nelson, M.L., Weigle, M.C.: Difficulties of Timestamping
Archived Web Pages. Tech. Rep. arXiv:1712.03140 (2017)
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
43
Our study indicates that about 88% of
mementos produce different hashes
• 16,627 archived pages
• From 17 public web archives
• Downloaded 35 times
• Between November 16, 2017 and
October 19, 2018
Preliminary work
Department of Computer Science, PhD Gathering, January 17, 2019
@maturban1, @WebSciDL
44
Conclusions
• We downloaded 16,627 mementos 35 times
between November 16, 2017 and October 19, 2018
• Within about 11 months, we found that 88% of
mementos produce different hash values
• It is hard to get repeatable hashes on the playback of
archived web pages because of transient errors,
dynamic URI-Ms, and instability of indexes in archives
• We need an archive-aware hashing function to
produce repeatable hashes
• https://www.cs.odu.edu/~maturban/fixity.html
For more information:
1 of 44

Recommended

A Framework for Verifying the Fixity of Archived Web Resources by
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resourcesmaturban
460 views140 slides
Archive Assisted Archival Fixity Verification Framework by
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
1.3K views31 slides
DHUG 2018: Towards Web-Centric Repository Interoperability by
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
240 views67 slides
Web Archives at the Nexus of Good Fakes and Flawed Originals by
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
5.9K views103 slides
MementoMap Framework for Flexible and Adaptive Web Archive Profiling by
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
2.2K views33 slides
Weaponized Web Archives: Provenance Laundering of Short Order Evidence by
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
2.8K views44 slides

More Related Content

Similar to It is hard to compute fixity on archived web pages

Establishing and Verifying Fixity of Archived Web Pages by
Establishing and Verifying Fixity of Archived Web PagesEstablishing and Verifying Fixity of Archived Web Pages
Establishing and Verifying Fixity of Archived Web Pagesmaturban
1.2K views85 slides
It is hard to compute fixity on archived web pages by
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
1.3K views57 slides
Web Archiving in the Year eaee1902f186819154789ee22ca30035 by
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
249 views30 slides
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... by
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
933 views58 slides
Designing for Sustainability - WebVisions 2016 by
Designing for Sustainability - WebVisions 2016Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016Tim Frick
803 views62 slides
Oggcamp Fast and Beautiful Images by
Oggcamp Fast and Beautiful ImagesOggcamp Fast and Beautiful Images
Oggcamp Fast and Beautiful ImagesDoug Sillars
289 views88 slides

Similar to It is hard to compute fixity on archived web pages(20)

Establishing and Verifying Fixity of Archived Web Pages by maturban
Establishing and Verifying Fixity of Archived Web PagesEstablishing and Verifying Fixity of Archived Web Pages
Establishing and Verifying Fixity of Archived Web Pages
maturban1.2K views
It is hard to compute fixity on archived web pages by maturban
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
maturban1.3K views
Web Archiving in the Year eaee1902f186819154789ee22ca30035 by Michael Nelson
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Michael Nelson249 views
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... by Chris Bizer
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Chris Bizer933 views
Designing for Sustainability - WebVisions 2016 by Tim Frick
Designing for Sustainability - WebVisions 2016Designing for Sustainability - WebVisions 2016
Designing for Sustainability - WebVisions 2016
Tim Frick803 views
Oggcamp Fast and Beautiful Images by Doug Sillars
Oggcamp Fast and Beautiful ImagesOggcamp Fast and Beautiful Images
Oggcamp Fast and Beautiful Images
Doug Sillars289 views
Recommending Archived Webpages Using Only The URI by LulwahMA
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
LulwahMA296 views
Preserving a Web of Linked Data: Lessons and challenges from a fading web by Miel Vander Sande
Preserving a Web of Linked Data: Lessons and challenges from a fading webPreserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading web
Turin webperf meetup by Doug Sillars
Turin webperf meetupTurin webperf meetup
Turin webperf meetup
Doug Sillars392 views
21st Century Learning And Science Resources by Karen Brooks
21st Century Learning And Science Resources21st Century Learning And Science Resources
21st Century Learning And Science Resources
Karen Brooks2.6K views
21st Century Learning And Science Resources 1224179087427471 8 by Piyawan
21st Century Learning And Science Resources 1224179087427471 821st Century Learning And Science Resources 1224179087427471 8
21st Century Learning And Science Resources 1224179087427471 8
Piyawan379 views
Lessons Learned From the Longitudinal Sampling of a Large Web Archive by Kritika Garg
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Kritika Garg149 views
Innovating Together: the UX of Discovery by Bohyun Kim
Innovating Together: the UX of DiscoveryInnovating Together: the UX of Discovery
Innovating Together: the UX of Discovery
Bohyun Kim645 views
WS-DL’s Work towards Enabling Personal Use of Web Archives by Michele Weigle
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
Michele Weigle589 views
Robust Linking to Web Resources by Martin Klein
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
Martin Klein1.8K views
Djangocon Europe 2017: Planet Friendly Django by Chris Adams
Djangocon Europe 2017: Planet Friendly DjangoDjangocon Europe 2017: Planet Friendly Django
Djangocon Europe 2017: Planet Friendly Django
Chris Adams471 views
Data Visualization and Mapping using Javascript by Mack Hardy
Data Visualization and Mapping using JavascriptData Visualization and Mapping using Javascript
Data Visualization and Mapping using Javascript
Mack Hardy8.1K views
Representing the world: How web users become web thinkers and web makers by judell
Representing the world: How web users become web thinkers and web makersRepresenting the world: How web users become web thinkers and web makers
Representing the world: How web users become web thinkers and web makers
judell1.6K views

Recently uploaded

handbook for web 3 adoption.pdf by
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
19 views16 slides
Data-centric AI and the convergence of data and model engineering: opportunit... by
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
34 views40 slides
AMAZON PRODUCT RESEARCH.pdf by
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdfJerikkLaureta
15 views13 slides
ChatGPT and AI for Web Developers by
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web DevelopersMaximiliano Firtman
181 views82 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
225 views86 slides
Java Platform Approach 1.0 - Picnic Meetup by
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic MeetupRick Ossendrijver
25 views39 slides

Recently uploaded(20)

handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex19 views
Data-centric AI and the convergence of data and model engineering: opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier34 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta15 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software225 views
1st parposal presentation.pptx by i238212
1st parposal presentation.pptx1st parposal presentation.pptx
1st parposal presentation.pptx
i2382129 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
DALI Basics Course 2023 by Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg14 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2216 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada121 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada130 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10209 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker26 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab15 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst470 views

It is hard to compute fixity on archived web pages

  • 1. It is hard to compute fixity on archived web pages Mohamed Aturban Advisor: Michele C. Weigle Co-advisor: Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group @WebSciDL, @maturban1 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 3. 3 The Internet Archive allows us to view previous versions (mementos) of that page
  • 6. https://web.archive.org/web/20160708040004/https://climate.nasa.gov/vital-signs/carbon-dioxide/ Live web page vs. archived web page https://climate.nasa.gov/vital-signs/carbon-dioxide/ July 2016 Now
  • 7. 7 Web pages change on the live web? Time Live Web May 2016 April 2017 April 2018 climate.nasa.gov/vital-signs/carbon-dioxide/
  • 8. 8 Archives make copies of web pages Time Live Web Archive May 2016 April 2017 April 2018 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 9. 9 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 10. 10 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  • 11. 11 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  • 12. 12 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  • 13. 400.15 ppm 13 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 When replaying the archived page at different points in time, will we get the same content?
  • 14. 400.15 ppm 14 Do archived pages change? Time Live Web Archive Replay May 2016 Our study shows that we are not always presented with the same archived content! ? April 2017 April 2018
  • 15. 15 Cryptographic hashes to create fixity information • Common hash algorithms (e.g., MD5, SHA256): A small change in the input  a large change output SHA256 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 16. 16 Generate hashes on an archived page • Compute a hash value on the downloaded HTML content % curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d Compute SHA256 hash Download the page Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 17. 17 What if an image has changed? Computing hashes on only HTML content will NOT detect changes
  • 18. 18 Potential solution: include all resources in hash calculation https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/ • 201 images • 19 JavaScript files • 3 CSS files • Main HTML file A single aggregated hash value www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page ) has A composite memento https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html Turns out it is hard to get repeatable hashes on composite mementos
  • 19. 19 Archives transform original content to appropriately replay mementos in a user’s browser • Add banners • Rewrite links to point to the archive, not to the live web • Modify HTML code to convey metadata • Apply some policies for security (e.g., block some content) • Provide the content in different format (e.g., ZIP and screenshots) Transformation examples: Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 20. 20 Archives add banners • To convey information like the number of mementos and inform users that what they are viewing is from the archive • Banners change  different hashes Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos) http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 21. 21 Archives rewrite links to embedded resources web.archive.org/web/19961120150251 /http://www.usnews.com:80/ http://web.archive.org/web/19970725063110im_/http://www.usnews.com:80/usnews/GRAPHICS/logo.gif http://www.usnews.com:80/usnews/GRAPHICS/logo.gif Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 22. 22 Live web resources linked from archives • Resources from the live web are expected to change  different hashes • Based on feedback from Lerner et al., IA solved this issue with Content- Security-Policy HTTP header, but the problem might still occur in other archives http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html Archived in 2008 The ad is from 2012 This memento was replayed in 2012 A. Lerner, T. Kohno, and F. Roesner. Rewriting history: Changing the archived web from the present. In Proceedings of the 16th ACM conference on Computer and Communications Security (CCS), pages 1741–1755, 2017.
  • 23. 23 Caches may temporarily hide changes in the playback % date Mon Oct 2 01:15:18 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:16:29 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 01:19:31 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 477b6d923cbb7bf9675a0d2feb37afd3 % date Mon Oct 2 02:10:24 EDT 2017 % curl -s http://web.archive.org/web/20130724144801/htt p://www.cnn.com/ | md5 dda6a9bf091d412cbdc2226ce3eb1059 X-Page-Cache: MISS X-Page-Cache: HIT X-Page-Cache: MISS X-Page-Cache: HIT
  • 24. 24 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 25. 25 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 26. 26 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 27. 27 Dynamic content by JS  different hashes Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 28. 28 Dynamic content by JS  different hashes A large number of mementos are unavailable Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 29. 29 A resource selected randomly by JavaScript https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
  • 32. 32 https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ A resource selected randomly by JavaScript function random_imglink(){ myimages[1]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannerbluemnt.jpg"; myimages[2]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannereagle.jpg"; myimages[3]="/congress112th/20130119060624/http://www.fws.g ov/home/feature/home-banner/open-spaces/bannertiger.jpg"; var ry=Math.floor(Math.random(1)*myimages.length) if (ry==0) ry=1 document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'" border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to Read"></a>') }
  • 33. 400.15 ppm 33 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 A TimeMap = a list of available archived pages
  • 34. 400.15 ppm 34 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 35. 400.15 ppm 35 Do archived pages change? Time Live Web Archive Replay May 2016 April 2017 April 2018 X 302 Redirect Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 36. Changes in TimeMaps  different HTTP entity  different hashes URI-M1 was NOT available URI-M1 URI-M2 • You can see the difference in the URI-M of the main HTML file web.archive.org/web/20080828005922/http://www.evangelcogdayton.org/ web.archive.org/web/20090211151609/http://www.evangelcogdayton.org/
  • 37. December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 37 URI-M1 was NOT available URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G Changes in TimeMaps  different image  different hashes • You can't see the difference in the URI-M of the main HTML file, but you can see the difference in the embedded images https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
  • 38. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M2 = perma-archives.org/warc/20170619145458id_/ http://umich.edu/includes/image/type/gallery/id/113/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M1 was NOT available Different image Changes in TimeMaps  different image that looks the same  different hashes • You can't see the difference in the URI-M of the main HTML file nor the difference in the embedded images http://perma-archives.org/warc/20170101182813id_/http://umich.edu/ http://perma-archives.org/warc/20170101182813id_/http://umich.edu/
  • 39. 39 Transient error • Incomplete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg Download the image on December 7, 2017 WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-12-07T10:04:18Z … Content-Length: 459640 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 07 Dec 2017 10:04:18 GMT … The first Content-length should be bigger than the second one
  • 40. 40 The complete HTTP entity http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchive s.gov.uk/20170303010736id_/https: //cereals.ahdb.org.uk/media/11578 42/corporate-strategy-1.jpg WARC-Date: 2017-11-16T15:34:37Z … Content-Length: 643398 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 16 Nov 2017 15:34:36 GMT … This is what it should look like
  • 41. Requesting the raw version, a third party service (Cloudflare) injects HTML code curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:45 GMT <a href="/cdn-cgi/l/email- protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464 54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1 207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” … curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:50 GMT <a href="/cdn-cgi/l/email- protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060 50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5 247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” … curl -silent http://perma- archives.org/warc/20171026200017id_/https://www.usa.gov/federal-agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:51 GMT <a href="/cdn-cgi/l/email- protection#b986caccdbd3dcdacd84f899c599f894e399f0d7dddcc199d6df99ec97ea9799fed6cfdccbd7d 4dcd7cd99fddcc9d8cbcdd4dcd7cdca99d8d7dd99f8dedcd7dad0dcca9fd8d4c982dbd6ddc084d1cdcdc9ca8 39696cecece97cccad897ded6cf96dfdcdddccbd8d594d8dedcd7dad0dcca96d8” …
  • 42. 42 Requirements for generating repeatable hashes 1. Generate a hash on a composite memento 2. Exclude archive-specific resources 3. Avoid resources from the live web 4. Avoid content served from cache 5. Changes in TimeMaps might affect the computation of hashes 6. Avoid including dynamic content https://arxiv.org/pdf/1712.03140.pdf Aturban, M, Nelson, M.L., Weigle, M.C.: Difficulties of Timestamping Archived Web Pages. Tech. Rep. arXiv:1712.03140 (2017) Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 43. 43 Our study indicates that about 88% of mementos produce different hashes • 16,627 archived pages • From 17 public web archives • Downloaded 35 times • Between November 16, 2017 and October 19, 2018 Preliminary work Department of Computer Science, PhD Gathering, January 17, 2019 @maturban1, @WebSciDL
  • 44. 44 Conclusions • We downloaded 16,627 mementos 35 times between November 16, 2017 and October 19, 2018 • Within about 11 months, we found that 88% of mementos produce different hash values • It is hard to get repeatable hashes on the playback of archived web pages because of transient errors, dynamic URI-Ms, and instability of indexes in archives • We need an archive-aware hashing function to produce repeatable hashes • https://www.cs.odu.edu/~maturban/fixity.html For more information: