SlideShare a Scribd company logo
1 of 140
Download to read offline
PhD Dissertation Defense for:
Mohamed Aturban
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Jian Wu,
Sampath Jayarathna, and M'Hammed Abdous
A Framework for Verifying the Fixity
of Archived Web Resources
Department of Computer Science
Norfolk, Virginia 23529 USA
July 23, 2020
PhD Dissertation Defense for:
Mohamed Aturban
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Jian Wu,
Sampath Jayarathna, and M'Hammed Abdous
A Framework for Verifying the Fixity
of Archived Web Resources
Department of Computer Science
Norfolk, Virginia 23529 USA
July 23, 2020
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
2
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
This is what www.cnn.com looks like today
33
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
The Internet Archive (IA) allows
us to view previous versions
(mementos) of that page
• IA is the world’s largest public
web archive
• It holds hundreds of billions of
archived web pages
https://web.archive.org/web/20130401000000*/http://www.cnn.com/
PhD Defense: Mohamed Aturban
July 23, 2020
4
The CNN archived page from May 30, 2013
• Replaying this memento in 2018
• There was a thunderstorm in Atlanta, GA on May 30, 2013
5
6
When reloading (#1) the memento in the browser,
the weather icon changed to “cloudy”
7
When reloading (#2) the memento in the browser,
the weather icon changed to “partly sunny”
When reloading (#3) the memento in the browser,
the weather icon changed to “partly sunny”
8
Replaying the same memento multiple times
does not always produce the same results!
• The changes on the
playback of this mementos
are caused by JavaScript
(JS) being executed on the
client side (e.g., the
browser)
• In this example, each time
JS is executed, it loads
randomly one of the
weather icons
9
Textbooks vs. archived pages
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR4FM1VszineUIBCFEQchQTnaZWwKJE7BoUU1u1h3fmrbLdpWl8
A book in a library Replayed mementos
10
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
This is what climate.nasa.gov/vital-signs/carbon-dioxide/
looks like today
11
This is what it looked like in July 2018
12
https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/
A memento created by
the Internet Archive in
July 2018. It is replayed
now (2019).
13
The page in other web archives
web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide4,870
archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/13
wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/91
perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/4
arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide3
Typical archive URI construction:
archive.example.org/archive-collection/climate.nasa.gov/vital-signs/carbon-dioxide
webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide5
Mementos
for a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
What if we checked these archives?
What if they all agree?
Would you trust the results?
breitbart.com/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/
infowars.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
MichaelsEvilWayback.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/
InternetResearchAgency.ru/climate.nasa.gov/vital-signs/carbon-dioxide/
14
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
15
The web page is archived July 2017 by
Michael’sEvilWayback
Which one is the real memento?
Replayed in August 2017 Replayed in October 2017 15
16
It is important to verify fixity of archived resources
Evidentiary purposes in court cases
• Marten Transport v. PlatForm Advertising
• Telewizja Polska USA, Inc. v. Echostar Satellite Corp
• St. Luke’s Cataract & Laser Institute v. James C. Sanderson
• https://www.bloomberglaw.com/public/desktop/document/Marten_Transp_Ltd_v_PlattForm_Adver_Inc_No_142464JWL_2016_BL_1371?1462657373
• https://casetext.com/case/telewizja-polska-usa-4
• https://caselaw.findlaw.com/us-11th-circuit/1351498.html
• https://web.stanford.edu/~gentzkow/research/fakenews.pdf
• https://www.nytimes.com/2016/12/05/us/politics/-michael-flynn-trump-fake-news-clinton.html
• https://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17
• https://www.newyorker.com/magazine/2015/01/26/cobweb
• https://www.datarefuge.org
• http://eotarchive.cdlib.org
Preserving fake news and important news articles
• H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,”
Journal of Economic Perspectives, vol. 31, no. 2, pp. 211–36, 2017.
• M. Rosenberg, “Trump Adviser Has Pushed Clinton Conspiracy Theories,” The New York Times, 2016
Providing information about certain incidents or crimes
• A. Bright, “Web evidence points to pro-Russia rebels in downing of MH17,”
The Christian Science Monitor, 2014
Preserving federal and government data
• The Data Refuge project is an attempt to preserve federal climate and environmental data
• The End of Term Web Archive preserves U.S. Government websites around every new presidential
election
16
A disclaimer from the Internet Archive stating that the archive
is not responsible for the reliability of the archive resources
https://archive.org/about/terms.php
1717
Web pages change on the live web
Time
Live
Web
May
2016
April
2017
April
2018
18
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Archives make copies of web pages
Live
Web
Archive
May
2016
April
2017
April
2018
Time
19
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Time
20
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Do archived pages change?
Live
Web
Archive
Replay
May
2016
When replaying the archived page at different
points in time, will we get the same content?
April
2017
April
2018
Time
21
Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
22
Time
When replaying the archived page at different
points in time, will we get the same content?
Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
23
Time
When replaying the archived page at different
points in time, will we get the same content?
Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
24
Time
When replaying the archived page at different
points in time, will we get the same content?
Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
25
Time
When replaying the archived page at different
points in time, will we get the same content?
Do archived pages change?
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
26
Time
When replaying the archived page at different
points in time, will we get the same content?
Do archived pages change?
Live
Web
Archive
Replay
May
2016
Our study shows that we are not always
presented with the same archived content!
?
April
2017
April
2018
27
Time
209
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
28
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
Research questions
29
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
RQ2: Given the types of changes identified in the playback of mementos,
what steps/guidelines should we follow in order to generate repeatable
fixity information (defining an archive-aware fixity-based approach)?
Research questions
30
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
RQ2: Given the types of changes identified in the playback of mementos,
what steps/guidelines should we follow in order to generate repeatable
fixity information (defining an archive-aware fixity-based approach)?
RQ3: How can we store and retrieve fixity information independently from
the web archives from which the associated mementos are preserved?
Research questions
31
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
32
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Generating cryptographic hash values (fixity
information)
• Common hash algorithms (e.g., MD5, SHA256):
A small change in the input à a large change output
SHA256
9801 1510 87e1 6d6b
ddb9 e6b0 09fd b723
abe5 1fea b548 0914
a130 6325 5ae4 6caa
5d4d b590 605c 9023
000d 6622 6004 534f
e84a 5549 d535 f91e
cdf4 4952 5c1a 37cf
SHA256
33
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
34
SimHash:
A small change in the input à a small change in the output
'Klein et al. conducted a study on over one
million references from scientific articles
and found that 20% articles suffers from
Reference Rot, referring to links to web
resources that no longer exist or that have
significantly modified content.'
SimHash
668c8cccd966a785
https://github.com/leonsim/simhash
'Klein et al. conducted a study on over one
million references from scientific articles
and found that 30% articles suffers from
Reference Rot, referring to links to web
resources that no longer exist or that have
significantly modified content.'
SimHash
668c8cced966a785
M. Klein, H. Van de Sompel, R. Sanderson, H. Shankar, L. Balakireva, K. Zhou, and R. Tobin, “Scholarly context not found: One in five articles suffers from reference rot,” PloS one, vol. 9, no. 12, 2014. e115253.
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
• We can use SimHash to compare text-based files and pHash to compare images
35
An example of a binary hash tree (or Merkle tree)
https://brilliant.org/wiki/merkle-tree/
• A leaf nodes = the hash
of a block of data
• A non-leaf node = the
hash of its children
Generate hashes on a web page
• Compute a hash value on the downloaded HTML content
$ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256
17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d
Compute SHA256 hashDownload the page
36
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Fixity
information
Verifying the fixity of a web page
Hashes are NOT identical à the page has changed!
• Compare
the current
hash with
previously
calculated
hash
37
Time
HTML
content is
downloaded
e834 c71a efda 284f e03a 4eed 4e8c b78e
a581 537b a888 4aec ec29 bd2d 66cb f521
SHA256
Hash
HTML
content is
downloaded
fc90 88b3 a614 a588 40bd 5387 d93c 16be
824c d2bb b3fa b173 f93f a57d 241a 3790
SHA256
Hash
August 2017
October 2017
The archived page has been tampered with by changing the value of COSeptember 2017
2
Verifying the fixity of a web page
Hashes are NOT identical à the page has changed!
• Compare
the current
hash with
previously
calculated
hash
38
- Users of web archives do not have the ability to easily verify
the fixity of mementos.
- Most web archives do not allow accessing fixity information
- Even if fixity information is available, it is not from an
independent archive or service.
What if an image has changed?
• Computing hashes on only HTML content will NOT detect changes
39
Potential solution: include all resources in hash calculation
• 201 images
• 19 JavaScript files
• 3 CSS files
• Base HTML file
A single aggregated
hash value
Consists of
Turns out it is hard to get
repeatable hashes on
composite mementos
A composite
memento
• www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page)
• https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/
• https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
• http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html
40
Archives add banners
• To convey information like the number of mementos and inform users that
what they are viewing is from the archive
• Banners change à different hashes
Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos)
http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk
41
Archives transform original content to appropriately
replay mementos in a user’s browser
• Add banners
• Rewrite links to point to the archive, not to
the live web
• Add HTML tags to convey metadata
• Archives use one of the Wayback Machine’s
implementations to replay mementos
• https://archive.org/web/
• https://github.com/iipc/openwayback/wiki
• https://github.com/ikreymer/
PyWb
42
@maturban1 • @WebSciDL
Rewriting original content by archives’ replay tools
An image
A CSS file
The page is captured by the Internet Archive:
https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html
4343
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Original content
https://web.archive.org/web/20190725212938/https://maturban.github.io/
playground/index.html
Rewritten Content
Add banners
44
Original content
https://web.archive.org/web/20190725212938/https://maturban.github.io/
playground/index.html
45
Rewritten Content
Example:
https://www.odu.edu/images/logo-university.png
https://web.archive.org/web/20190725212938im_/https://
www.odu.edu/images/logo-university.png
Add banners
Rewrite links
45
Original content
https://web.archive.org/web/20190725212938/https://maturban.github.io/
playground/index.html
46
Rewritten Content
Add HTML tags
Add banners
Rewrite links
46
Live web
The Archive
https://web.archive.org/web/20190725212938id_/https://maturban.git
hub.io/playground/index.html
Raw Mementos
• Many archives allow accessing original, or raw, archived content
• E.g., using the option id_ after the timestamp
47
We need an archive-aware hashing
function suitable for mementos
Archive Repeatable
hash value?
JavaScript
Michael’sEvilWayback
Transform
ation
48
Security
Archive
May
2016
April
2017
April
2018
Time
Live
Web
TimeMap
• Defined by Memento framework (an Internet RFC)
• A TimeMap for an Original Resource “as a resource from which a list of URIs of
Mementos of the Original Resource is available.”
49
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
https://climate.nasa.gov/vital-signs/carbon-dioxide/
Archive
May
2016
April
2017
April
2018
Time
Live
Web
The TimeMap of the resource climate.nasa.gov/vital-signs/carbon-dioxide/ has three
mementos
TimeMap
• Defined by Memento framework (an Internet RFC)
• A TimeMap for an Original Resource “as a resource from which a list of URIs of
Mementos of the Original Resource is available.”
50
Memento aggregators
• Aggregate TimeMaps, of an Original Resource, from multiple archives into a
single TimeMap
• LANL Memento aggregator
⁃ http://mementoweb.org/depot/
⁃ https://github.com/oduwsdl/MemGator
• MemGator
51
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Downloading the TimeMap of climate.nasa.gov/vital-
signs/carbon-dioxide/ using MemGator
web.archive.org4,870
archive.is13
wayback.archive-it.org91
perma-archives.org4
arquivo.pt3
webarchive.loc.gov5
Mementos
http://timetravel.mementoweb.org/timemap/link/climate.nasa.
gov/vital-signs/carbon-dioxide/ 52
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
53
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
54
Sampling of Related Work
TRAC (2007)
Establishing trusted archives
- TRAC not for playback
Lerner et al. (2017)
Vulnerabilities
- Discovered four vulnerabilities in
the Internet Archive’s Wayback
Machine
J. Cushman et al. (2017)
More potential threats
- Demonstrate potential threats in
web archives
Rosenthal et al. (2005)
Threats
- Described several threats
against digital preservation
systems
Juan Benet (2017)
Multihash
- Self identifying hashes for
IPFS
OriginStamp, Gipp (2015,
2016) Trusted timestamps in
Blockchain
- Not suitable for composite
mementos
T. Kuhn et al. (2014)
Trusty URI
- A URI that contains a hash
value of the content it
identifies
P. Maniatis et al. (2005)
Distributed copies of archived
resources (LOCKSS)
- The scope and content are narrowly
defined
opentimestamps.org/ (2017)
OpenTimestamps
- Not suitable for composite
mementos
chainpoint.org (2017)
Chainpoint
- Not suitable for composite
mementos
Collomosse et al. (2018)
ARCHANGEL
- For mementos, but not
suitable for composite
mementos
Trusted
timestampingSecurity
Standards and
other systems
Identity derived
from content
Hamano et al. (2005)
Git, Distributed version
control
- Uses hash values to create
commits identifiers
Web archives, such as
webcitation.org, and
archive.is, use hash values
in URIs to identify mementos
Brunelle (2010)
Live web leakage in archives
- Describes how live web leakage
changes the representation of
mementos
Rosenthal et al. (2005)
Requirements for establishing
trusted digital preservation systems
- Not for playback
OAIS (2012)
Reference Model For An Open
Archival Information System (OAIS)
- Not for playback
54
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
55
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
RQ1: Can we identify and quantify the types of changes on the playback
of mementos that prevent generating repeatable fixity information?
Identifying and quantifying changes on the
playback of mementos
Collect a dataset
of mementos
Download
rewritten/raw
composite
mementos
Identify
changes
Present
results
39 Times
Generate
aggregated
hash values
1 2 3 4 5
M. Aturban, M. L. Nelson, and M. C. Weigle, “It is hard to compute fixity on archived web
pages,” in Proceedings of the Workshop on Web Archiving and Digital Libraries (WADL) held
in conjunction with the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018.
56PhD Defense: Mohamed Aturban
July 23, 2020
Collecting 16,627 Mementos
M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H.
Van de Sompel, “Collecting 16K archived web pages
from 17 public web archives,” Tech. Rep.
arXiv:1905.03836, May 2019.
• The HTTP Archive: httparchive.org
• The Web Archives for Historical Research:
uwaterloo.ca/web-archive-group/
• Not all mementos are created equal: measuring
the impact of missing resources, J. Brunelle et
al. (DOI: doi.org/10.1007/s0079)
• The Moz Top 500 Websites: moz.com/top500
Sources of URI-Rs:
Collect a dataset
of mementos
1
57
Downloading mementos using
Headless Chrome
http://web.archive.org/web/19961120150251/http://www.usnews.com:80/
https://github.com/N0taN3rd/Squidwarc
rewritten.warc
Squidwarc
Download
rewritten/raw
composite
mementos
2
58
59
Extract all URI-Ms
by reading WARC
records using the
tool warcio
Download
rewritten/raw
composite
mementos
2
rewritten.warc
https://github.com/webrecorder/warcio
60
Requesting the
raw mementos of
x
✓
✓
✓
x
x
x
x
x
x
x
x
x
x
x
✓
x
x
x
x
✓
200 Ok (or archival 4xx/5xx)✓
raw.warc
Using id_
X = Archive-specific resources
X = 3xx Redirect
Download
rewritten/raw
composite
mementos
2
61
Generate the aggregated hash with Merkle trees
Generate
aggregated
hash values
3
62
Identifying types of changes on the playback
of mementos
Set:
One or more resources in the set comprising a composite memento has changed
Status code:
The HTTP status code of one or more resources comprising a composite memento
has changed
HTTP Headers:
One or more HTTP response headers, that we do not expect to change, has changed
Representation:
The returned HTTP entity body of one or more resources comprising a composite
memento has changed
URI-M:
One or more resources in the set comprising a composite memento redirects to a
different memento with a different Memento-DateTime
Identify
changes
4
63
Set:
One or more resources in the set comprising a
composite memento has changed
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Reload # 1
Identify
changes
4
63
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Reload # 2
Set:
One or more resources in the set comprising a
composite memento has changed
Identify
changes
4
64
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Reload # 3
Set:
One or more resources in the set comprising a
composite memento has changed
Identify
changes
4
65
A resource selected randomly by JavaScript
Reload # 3
function random_imglink(){
myimages[1]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home-
banner/open-spaces/bannerbluemnt.jpg";
myimages[2]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home-
banner/open-spaces/bannereagle.jpg";
myimages[3]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home-
banner/open-spaces/bannertiger.jpg";
var ry=Math.floor(Math.random(1)*myimages.length)
if (ry==0)
ry=1
document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'"
border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to
Read"></a>’)
}
https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/
Identify
changes
4
66
Status code:
The HTTP status code of one or more resources
comprising a composite memento has changed
https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg
https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/
404 200
Identify
changes
4
67
Status code:
The HTTP status code of one or more resources
comprising a composite memento has changed
https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg
https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/
404 200
WARC/1.0
WARC-Type: response
WARC-Target-URI:
https://web.archive.org/save/_embed/http://wac.450F.edgecas
tcdn.net/80450F/noisecreep.com/files/2009/06/aaron_a042209eb
_200.jpg
WARC-Date: 2017-11-18T10:33:14Z
…
HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 10:32:51 GMT
Content-Type: image/jpeg
Content-Location:
/web/20171118103250/http://wac.450F.edgecastcdn.net/80450F/n
oisecreep.com/files/2009/06/aaron_a042209eb_200.jpg
Observations
change archives
Identify
changes
4
68
Headers:
One or more HTTP response headers, that we do
not expect to change, has changed
https://web.archive.org/web/20071111211818/http:// images.sohu.com:80/chat_online/market/sohu/140140-1.html
Replayed in 2017
Replayed in 2018
Identify
changes
4
69
Representation:
The returned HTTP entity body of one or more resources
comprising a composite memento has changed
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:45 GMT
<a href="/cdn-cgi/l/email-
protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464
54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1
207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” …
curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal-
agencies/a
| egrep -i "(cdn-cgi|^Date:)"
Date: Tue, 15 May 2018 21:00:50 GMT
<a href="/cdn-cgi/l/email-
protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060
50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5
247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” …
Requesting the raw version, a third party service (Cloudflare) modifies the HTML
Identify
changes
4
70
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Time
URI-M:
One or more resources in the set comprising a composite memento
redirects to a different memento with a different Memento-Datetime
Identify
changes
4
71PhD Defense: Mohamed Aturban
July 23, 2020
72
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Time
URI-M:
One or more resources in the set comprising a composite memento
redirects to a different memento with a different Memento-Datetime
Identify
changes
4
72PhD Defense: Mohamed Aturban
July 23, 2020
73
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Time
URI-M:
One or more resources in the set comprising a composite memento
redirects to a different memento with a different Memento-Datetime
Identify
changes
4
73PhD Defense: Mohamed Aturban
July 23, 2020
Live
Web
Archive
Replay
May
2016
April
2017
April
2018
Time
X 302 Redirect
URI-M:
One or more resources in the set comprising a composite memento
redirects to a different memento with a different Memento-Datetime
Identify
changes
4
74PhD Defense: Mohamed Aturban
July 23, 2020
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
December 2017
March 2018
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
Identify changes
4
75
76
https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/
URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc
43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G
December 2017
March 2018
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2
URI-M1 was
NOT available
Identify changes
4
December 12, 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2December 25, 2017
URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113
/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
77
Identify changes
4
December 12, 2017
302
Redirect
Requesting URI-M1
Requesting URI-M1 URI-M2December 25, 2017
URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113
/name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M2 = perma-archives.org/warc/20170619145458id_/http://umich.edu/includes/image/type/gallery/id/113
/name/ResearchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/
URI-M1 was
NOT available
Different image
78
Identify changes
4
79
88.45% of 16,627 mementos produce at
least two different hashes
Present
results
5
80
One in eight mementos (11.55%) always produce the
same hash and one in six mementos (16.06%) produce
a different hash on each replay
blue=11.55% (1,920 mementos)
red=16.06% (2,670 mementos)
Present
results
5
The types of changes affecting mementos after each download
Present
results
5
81
Migrated
and
missing
mementos
(11.91%)
Present
results
5
• https://ws-dl.blogspot.com/2019/08/2019-08-30-where-did-archive-go-part1.html
• https://ws-dl.blogspot.com/2019/09/2019-09-10-where-did-archive-go-part-2.html
• https://ws-dl.blogspot.com/2019/09/2019-09-25-where-did-archive-go-part-3.html
• https://ws-dl.blogspot.com/2019/10/2019-10-21-where-did-archive-go-part-4.html
Our blog posts about the movements of mementos:
82
Because most mementos produce multiple aggregated
hash values over time, we introduce two additional
hashing techniques
• URI-M-based hashing technique
Only URI-Ms of mementos comprising a composite
memento are used in the hash calculation
• Entity-based hashing technique
Only HTTP entity bodies of mementos comprising a composite
memento are used in the hash calculation
83
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
84
Complete hashing
84
8585
URI-M-based hashing
x
✓x
x
x
x
x
✓x
x
x
x
8686
Entity-based hashing
x
✓ x
x
x
x
x
✓ x
x
x
x
87
Expected
results
Complete hashing on mementos from archive.org
88
Expected
results
Complete hashing on mementos from archive.org
89
Complete hashing on mementos from archive.org
Expected
results
90
Complete hashing on mementos from archive.org
New hash values calculated in each download
(median = 871 hash values) 90
Only 47% of the total number of
hash values are seen in Download 1
91
URI-M-based hashing on mementos from archive.org
New URI-Ms requested in each download
(median = 806 URI-Ms) 91
Only 50% of the total number of URI-Ms
are requested in Download 1
92
Entity-based hashing on mementos from archive.org
New entity bodies observed in each download
(median = 116 entity body) 92
About 80% of the total number of entity
bodies are seen in Download 1
RQ2: Given the types of changes identified in the playback of mementos,
what steps/requirements should we consider in order to generate
repeatable fixity information (defining an archive-aware fixity-based
approach)?
Research questions
93
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Guidelines for generating fixity information
on the playback of mementos
• We define these guidelines based on results from our 14-month study
94
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
95
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
RQ3: How can we store and retrieve fixity information independently from the
web archives from which the associated mementos are preserved?
M. Aturban, S. Alam, M. L. Nelson, and M. C. Weigle, “Archive Assisted Archival
Fixity Verification Framework,” in Proceedings of the 19th ACM/IEEE Joint
Conference on Digital Libraries (JCDL), 2019.
96
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Two approaches for disseminating and verifying
the fixity of archived web pages
(using web archives to monitor web archives)
• The Atomic approach
• Generate a manifest file (a JSON file containing the fixity information) for each
memento
• Publish the manifest at a well-known location
• Disseminate the published manifest to several archives
• The Block approach
• Batch together fixity information of multiple mementos in a single binary-
searchable file (or block)
• Publish the block at a well-known location
• Disseminate the published block to several archives
97
@maturban1 • @WebSciDL
{
"@context": "http://oduwsdl.github.io/contexts/fixity",
"created_at": "Wed, 08 Apr 2020 02:22:56 GMT",
"@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20200408022256/d5d...f04/https://web.archive.org/web/19970104075414/
http://www.un.org:80/",
"composite-memento-uri-m-hash": "8bb453d8aa...db5f5fbf6",
"composite-memento-entity-hash": "cdf47062a...3ebe030b0",
"composite-memento-overall-hash": "69ca5a85...b9206930",
"uri-m": "https://web.archive.org/web/19970104075414/http://www.un.org:80/",
"resources": [
{
"http-headers": ["X-Archive-Orig-last-modified", "Content-Type", "X-Archive-Orig-date", "Memento-Datetime"],
"entity-phash": null,
"entity-hash": "ba140a5bede7f10bea0...7514725862eda82a003",
"overall-hash": "3e4133b3766c2a58d6f...23f5f95a206a1ba9878",
"entity-simhash": 9695187482751709335,
"uri-m": "https://web.archive.org/web/19970104075414/http://www.un.org:80/",
"status-code": "200"},
{
"http-headers": ["X-Archive-Orig-last-modified", "X-Archive-Orig-date", "Memento-Datetime", "Content-Type"],
"entity-phash": "87d5798529a75a58",
"entity-hash": "d9305a4da88570700d92...17c0c0f72b9f4e514b9",
"overall-hash": "de002fbe292372199e6...6059044f4695f0dda2c",
"entity-simhash": null,
"uri-m": "https://web.archive.org/web/19970315165323im_/http://www.un.org/homepage.gif",
"status-code": "200"},
{
"http-headers": ["Content-Type", "Memento-Datetime"],
"entity-phash": "219d1a8362a71040",
"entity-hash": "d5fd59c929e1c62b17b...d5e321f64a919e4294e",
"overall-hash": "7df9dde47fa5bab643...cef3c84694bb1db8b1c",
"entity-simhash": null,
"uri-m": "https://web.archive.org/web/20120129120857/http://web.archive.org/screenshot/http://www.un.org/",
"status-code": "200"}
]
}
A manifest
example
containing fixity
information
98
Atomic approach:
Push manifests into multiple archives
• In this example, the memento is in the Internet Archive (IA) and its fixity
information is disseminated to four archives including IA
• An attacker would have to hack a majority of 4 domains (archives)
https://archive.is/20181224093334/http://manifest.
ws-dl.cs.odu.edu/manifest/https://web.archive.org/
web/20181224085329/https://2019.jcdl.org/
https://web.archive.org/web/20181224093355/http://
manifest.ws-dl.cs.odu.edu/manifest/https://web.arc
hive.org/web/20181224085329/https://2019.jcdl.org/
https://perma-archives.org/warc/20181224093354/htt
p://manifest.ws-dl.cs.odu.edu/manifest/https://web
.archive.org/web/20181224085329/https://2019.jcdl.
org/
http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/
manifest/https://web.archi
ve.org/web/20181224085329/
https://2019.jcdl.org/
99
Block approach:
Batch together fixity information of multiple
mementos in a single file (block)
• Adding additional metadata (e.g., created_at, fields, …)
• The hash of the previous block must be added
!context ["http://oduwsdl.github.io/contexts/fixity"]
!fields {keys: ["surt"]}
!id {uri: "https://manifest.ws-dl.cs.odu.edu/"}
!meta {created_at: "20190111181327"}
!meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"}
!meta {type: "FixityBlock"}
org,archive,web)/web/19961022175434/http://search.com
org,archive,web)/web/19961219082428/http://sho.com
org,archive,web)/web/19961223174001/http://reference.com
… 100
Block approach:
Push the blocks entrypoint into multiple archives
manifest.ws-dl.cs.odu.edu/blocks
https://web.archive.org/web/20190121054059/https
://manifest.ws-dl.cs.odu.edu/blocks/7bbf757046ac
0a0a60015a1cb847c3189160d18c809b210073822df15760
9e01
• Will result in archiving the latest published block as well
https://perma.cc/8YG3-X7KN
101
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Three steps to verify the fixity of a memento
1. Discover a manifest/block
• In Atomic approach, this includes discovering the archived manifests
2. Compute current fixity information of the memento
3. Compare current fixity information with the discovered manifests/block.
$ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/
20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)”
HTTP/2 302
location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38
d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201
71115140705/http://rln.fm/
An example of discovering the latest manifest in the Archival Fixity
server for the memento:
https:/web.archive.org/web/20171115140705/http://rln.fm/
102
Evaluation
• A data set of 16K mementos from 17 public web archives
• For each memento, we generated and disseminated a manifest to 3 archives
- The median size of a
composite memento is
1143.85 KB
- The median size of a
manifest file is 15.29 KB,
which represents 1.33% of
the size of a composite
memento
103
Increasing the number of records per block reduces
the block generation time
104
The Block approach creates fewer resources
in archives than the Atomic approach
• Given a collection of N = 16,608 mementos
• Katomic = 3 web archives
• Kblock = 2 web archives
• The selected block size B = 1038 records per block
• The total number of resources created in the archives by
each approach:
Atomic
(N ∗ Katomic) = 49,824
Block
(Kblock ∗ (N/B)) = 32
105
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
It takes 1.09X and 4.54X longer to disseminate a manifest to perma.cc,
archive.org, respectively, than archive.is
106
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
It takes 9.2x longer to disseminate a block to archive.org than
perma.cc
107
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
The Block approach performs 1.05X faster than the Atomic approach on
verifying the fixity of mementos
Discovering and downloading manifest files
in the Atomic/Block approaches per archive
Verifying mementos by both approaches 108
Outline
Introduction and Motivation
Research Questions
Background
Sampling of Related work
Changes in the Playback of Mementos
The Fixity Information Dissemination Framework
Contributions, Future Work, and Conclusions
109
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Contributions
• RQ1
- Four methods for collecting mementos (arXiv’19)
- Identified and quantified types of changes on the playback of mementos (JCDL/WADL’18)
- Showed examples of missing mementos
• RQ2
- The two hashing techniques (URI-M-based and entity-based)
- The archive-aware hashing function (arXiv’17)
• RQ3
- ArchiveNow, a tool for disseminating web pages in public web archives (JCDL’18)
- A framework for disseminating fixity information to web archives (JCDL’19)
110
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Future Work
Investigating web packaging generating fixity information using
• Web packaging is an emerging standard
• It should allow archives to deliver a composite memento in a single HTTP
response or in a self-contained file
• Using web packaging we can download a composite memento,
packaged in a bundle, with a single HTTP request. This should reduce
playback-related changes, such as transient errors and URI-M changes.
111PhD Defense: Mohamed Aturban
July 23, 2020
Conclusions
• Conventional hashing techniques are not suitable for replayed archived web
pages.
• We defined an archive-aware hashing function that consists of multiple
guidelines (based on our 14-month study on 16K mementos)
• Fixity information includes
(1) Multiple aggregated hash values generated using different hashing
techniques (URI-M-based and entity-based hashing)
(2) Multiple hash values generated on each resource comprising a
composite memento
• We introduce two approaches for disseminating fixity information to web
archives
112
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
The archive-aware hashing function
113
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
PhD Dissertation Defense for:
Mohamed Aturban
Advisor:
Michele C. Weigle
Committee Members:
Michele C. Weigle, Michael L. Nelson, Jian Wu,
Sampath Jayarathna, and M'Hammed Abdous
A Framework for Verifying the Fixity
of Archived Web Resources
Department of Computer Science
Norfolk, Virginia 23529 USA
July 23, 2020
Appendix
Collecting 16,627 Mementos
M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H.
Van de Sompel, “Collecting 16K archived web pages
from 17 public web archives,” Tech. Rep.
arXiv:1905.03836, May 2019.
• The HTTP Archive: httparchive.org
• The Web Archives for Historical Research:
uwaterloo.ca/web-archive-group/
• Not all mementos are created equal: measuring
the impact of missing resources, J. Brunelle et
al. (DOI: doi.org/10.1007/s0079)
• The Moz Top 500 Websites: moz.com/top500
Sources of URI-Rs:
Collect a dataset
of mementos
1
116
http://collections.internetmemory.org/nli/
20121223031837/http://www2008.org/
• Mementos from the National Library of Ireland (NLI) collection
has been moved from collections.internetmemory.org/nli/ to
wayback.archive-it.org/10702/
An example of a missing memento
• The URI-M
was 200 OK in September 2018
http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/
• The URI-M
is now 404 Not Found
117
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
118
- The heatmap shows
archive-level changes
by comparing consecutive
downloads of mementos
- It visualizes the overall
performance of each
archive
- It identifies points in time
where major changes
occur
Present
results
5
URI-Rs with different path lengths and
URI-Ms with different Memento-Datetime
119
M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K
archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019.
URI-Ms per year
URI-Rs per path length
Select a dataset
of mementos
1
120
Downloading the ZIP file of a memento at three different times. Each time the
archive refers to itself differently in the index.html in the ZIP file.
http://archive.is/download/BRWpm.zip
http://archive.is/BRWpm
Representation: The returned HTTP entity body of one or more
resources comprising a composite memento has changed
Identify
changes
4
121
Downloading the ZIP file of a memento at three different times. Each time the
archive refers to itself differently in the index.html in the ZIP file.
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
Representation (transient errors)
Identify
changes
4
122
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/
http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchives.gov.uk/20
170303010736id_/https://cereals.ahdb.org.uk/
media/1157842/corporate-strategy-1.jpg
WARC-Date: 2017-12-07T10:04:18Z
…
Content-Length: 459640
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 07 Dec 2017 10:04:18 GMT
…
The first Content-length
should be bigger than
the second one
WARC/1.0
WARC-Type: response
WARC-Target-URI:
http://webarchive.nationalarchives.gov.uk/
20170303010736id_/https://cereals.ahdb.org
.uk/media/1157842/corporate-strategy-1.jpg
WARC-Date: 2017-11-16T15:34:37Z
…
Content-Length: 643398
HTTP/1.0 200
Content-Type: image/jpeg
Content-Length: 642336
Date: Thu, 16 Nov 2017 15:34:36 GMT
…
This is what it
should look like
Identify
changes
4
Downloading the ZIP file of a memento at three different times. Each time the
archive refers to itself differently in the index.html in the ZIP file.
Representation (transient errors)
123
124
Identify
changes
4 Archives react differently to requests for raw mementos
The Block approach performs 4.46x faster than the
Atomic approach in verifying the fixity of mementos
• The fixity verification time includes:
- Discovering manifests
- Computing current fixity information
- Downloading the archived manifests
- Comparing results
• On average, the verification
time of a memento is 6.65
seconds by the Atomic
approach and 1.49 seconds by
the Block approach
@maturban1 • August 22, 2019
A Framework for Verifying the Fixity
of Archived Web Resources
{
"@context": "http://manifest.ws-dl.cs.odu.edu/",
"created": "Sun, 23 Dec 2018 11:43:55 GMT",
"@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb
e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv
e.org/web/2018121102034/https://2019.jcdl .org/",
"uri-r": "https://2019.jcdl.org/",
"uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/",
"memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT",
"http-headers": {
"Content-Type": "text/html; charset=UTF-8",
"X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT",
"X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>;
rel="https://api.w.org/"",
"Preference-Applied": "original-links, original-content” },
"hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive-
Orig-date $X-Archive-O rig-link') | tee >(sha256sum)
>(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’
<(echo -e 'md5nsha256') - | paste -d' ' - -",
"hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9
a87f49646456369e51791a86106a6667630737a6"
}
A manifest file example
• Including how hashes are computed
• Hashes are computed on only base HTML file
• Compute fixity on things that should not change like certain original HTTP response headers 126
127
• Using web packaging we can download a composite memento, packaged in a
bundle, with a single HTTP request. This should reduce playback-related changes,
such as transient errors and URI-M changes.
128
Memento framework
• Uses time as a dimension to access the web by relating current web
resources to their prior states
• Is supported by most public web archives including the Internet
Archive
http://mementoweb.org/guide/quick-intro/
129
@maturban1 • @WebSciDL
130
URI-M-based hashing on mementos from archive-it.org
Expected results Actual results
131
New URI-Ms are
requested in each
download
URI-M-based hashing on mementos from archive-it.org
Actual results
132
New URI-Ms are
requested in each
download
URI-M-based hashing on mementos from archive-it.org
Expected results Actual results
133
Atomic approach (step 1):
Push a web page into multiple archives
https://archive.is/20181224085310/
https://2019.jcdl.org/
https://web.archive.org/web/201812
24085329/https://2019.jcdl.org/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
http://www.webcitation.org74tsy6pU0
https://2019.
jcdl.org/
This results in creating multiple mementos of the web page
Archive Assisted Archival Fixity Verification Framework ∙ JCDL 2019 ∙ June 4, 2019 ∙ Urbana-Champaign, Illinois
Atomic approach (steps 2 & 3):
For each memento, compute fixity “manifest” and publish it on the
web at a well-known location (Archival Fixity server)
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://web.archive.org/web/2018122
4085329/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
http://www.webcitation.org/74tsy6pU0
• In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server
• Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws-
dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html
134
manifest.ws-dl.cs.odu.edu/manifest/
https://archive.is/20181224085310/h
ttps://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://web.archive.org/web/2018122
4085329/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
https://perma-archives.org/warc/201
81224085330/https://2019.jcdl.org/
manifest.ws-dl.cs.odu.edu/manifest/
http://www.webcitation.org/74tsy6pU0
The manifest’s generic URI always redirects to the
most recent time-specific manifest version (trusty URI)
$curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web
.archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i
"(HTTP/|^location:)"
HTTP/2 302
location: https://manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9
160f98be466b7c9fb9a fa80580ab5052001174be59c6a73a/https://web.archive.org/
web/20181224085329/https://2019.jcdl.org/
HTTP/2 200
manifest’s trusty
URI
manifest’s
generic URI
• The structure of generic URIs is easy to remember: <Archival-Fixity-Server>/<URI to memento>
So they can be used to look up manifests in both the Archival Fixity server and archives
135
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
Block approach (step 2):
Publish the block file at the Archival Fixity server
manifest.ws-dl.cs.odu.edu/blocks
The blocks entrypoint always
redirects to the latest published
block
136
The dissemination/download time varies
from one archive to another
137
@maturban1 • @WebSciDL
PhD Defense: Mohamed Aturban
July 23, 2020
138
Links
139
• Animated GIFs: https://github.com/oduwsdl/mementos-fixity/tree/master/hashing_techniques
Libyan food
140
Tajine
Couscous
https://cookingthyme.wordpress.com/2014/07/19/tajeen-jban-
cheese-tagine-%D8%B7%D8%A7%D8%AC%D9%8A%D9%86-
%D8%AC%D8%A8%D9%86/
https://www.daringgourmet.com/kusksu-libyan-couscous-with-spicy-beef-and-vegetables/
Baklawa
https://www.pinterest.ie/pin/612067405578694390/

More Related Content

What's hot

Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Shawn Jones
 

What's hot (20)

A Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web ArchivesA Framework for Aggregating Private and Public Web Archives
A Framework for Aggregating Private and Public Web Archives
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
 
#OERde14 Keynote: "Generation Open: An International Look at the Coming Revol...
#OERde14 Keynote: "Generation Open: An International Look at the Coming Revol...#OERde14 Keynote: "Generation Open: An International Look at the Coming Revol...
#OERde14 Keynote: "Generation Open: An International Look at the Coming Revol...
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 

Similar to A Framework for Verifying the Fixity of Archived Web Resources

BL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - DatasetsBL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - Datasets
johnkayebl
 
Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association
Roxanne Missingham
 
British Library Social Science National Postgraduate Training Day - Datasets ...
British Library Social Science National Postgraduate Training Day - Datasets ...British Library Social Science National Postgraduate Training Day - Datasets ...
British Library Social Science National Postgraduate Training Day - Datasets ...
johnkayebl
 
23 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 201723 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 2017
ARDC
 

Similar to A Framework for Verifying the Fixity of Archived Web Resources (20)

It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 
Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Intro to Web Archiving
Intro to Web ArchivingIntro to Web Archiving
Intro to Web Archiving
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
BL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - DatasetsBL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - Datasets
 
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Evaluating Social Media Reach via Mainstream Media Discourse
Evaluating Social Media Reach via Mainstream Media DiscourseEvaluating Social Media Reach via Mainstream Media Discourse
Evaluating Social Media Reach via Mainstream Media Discourse
 
AAAI 2023 FSS - AI & Climate - Panel on Financing 20231026 v4.pptx
AAAI 2023 FSS - AI & Climate - Panel on Financing 20231026 v4.pptxAAAI 2023 FSS - AI & Climate - Panel on Financing 20231026 v4.pptx
AAAI 2023 FSS - AI & Climate - Panel on Financing 20231026 v4.pptx
 
Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association
 
National QA Guidelines for Digital Education: Crafting a Multi-layered Box of...
National QA Guidelines for Digital Education: Crafting a Multi-layered Box of...National QA Guidelines for Digital Education: Crafting a Multi-layered Box of...
National QA Guidelines for Digital Education: Crafting a Multi-layered Box of...
 
LIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data ManagementLIBER Webinar: 23 Things About Research Data Management
LIBER Webinar: 23 Things About Research Data Management
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 
Library 2.0 and Web 2.0
Library 2.0 and Web 2.0Library 2.0 and Web 2.0
Library 2.0 and Web 2.0
 
Addressing Multiple Audiences with Multiple Interfaces to The AMICO Library™
Addressing Multiple Audiences with Multiple Interfaces to The AMICO Library™Addressing Multiple Audiences with Multiple Interfaces to The AMICO Library™
Addressing Multiple Audiences with Multiple Interfaces to The AMICO Library™
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Preserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading webPreserving a Web of Linked Data: Lessons and challenges from a fading web
Preserving a Web of Linked Data: Lessons and challenges from a fading web
 
British Library Social Science National Postgraduate Training Day - Datasets ...
British Library Social Science National Postgraduate Training Day - Datasets ...British Library Social Science National Postgraduate Training Day - Datasets ...
British Library Social Science National Postgraduate Training Day - Datasets ...
 
Something about links
Something about linksSomething about links
Something about links
 
23 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 201723 things for Research Data - LIBER webinar 23 Feb 2017
23 things for Research Data - LIBER webinar 23 Feb 2017
 

Recently uploaded

Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
Sérgio Sacani
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
jyothisaisri
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Sérgio Sacani
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
Sérgio Sacani
 

Recently uploaded (20)

Continuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discsContinuum emission from within the plunging region of black hole discs
Continuum emission from within the plunging region of black hole discs
 
family therapy psychotherapy types .pdf
family therapy psychotherapy types  .pdffamily therapy psychotherapy types  .pdf
family therapy psychotherapy types .pdf
 
GBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interactionGBSN - Microbiology (Unit 6) Human and Microbial interaction
GBSN - Microbiology (Unit 6) Human and Microbial interaction
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)GBSN - Microbiology Lab (Compound Microscope)
GBSN - Microbiology Lab (Compound Microscope)
 
Tuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notesTuberculosis (TB)-Notes.pdf microbiology notes
Tuberculosis (TB)-Notes.pdf microbiology notes
 
Cellular Communication and regulation of communication mechanisms to sing the...
Cellular Communication and regulation of communication mechanisms to sing the...Cellular Communication and regulation of communication mechanisms to sing the...
Cellular Communication and regulation of communication mechanisms to sing the...
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
The Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdfThe Scientific names of some important families of Industrial plants .pdf
The Scientific names of some important families of Industrial plants .pdf
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Lubrication System in forced feed system
Lubrication System in forced feed systemLubrication System in forced feed system
Lubrication System in forced feed system
 
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
 
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptxBiochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
Biochemistry and Biomolecules - Science - 9th Grade by Slidesgo.pptx
 
The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 

A Framework for Verifying the Fixity of Archived Web Resources

  • 1. PhD Dissertation Defense for: Mohamed Aturban Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Jian Wu, Sampath Jayarathna, and M'Hammed Abdous A Framework for Verifying the Fixity of Archived Web Resources Department of Computer Science Norfolk, Virginia 23529 USA July 23, 2020 PhD Dissertation Defense for: Mohamed Aturban Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Jian Wu, Sampath Jayarathna, and M'Hammed Abdous A Framework for Verifying the Fixity of Archived Web Resources Department of Computer Science Norfolk, Virginia 23529 USA July 23, 2020
  • 2. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 2 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 3. This is what www.cnn.com looks like today 33 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 4. The Internet Archive (IA) allows us to view previous versions (mementos) of that page • IA is the world’s largest public web archive • It holds hundreds of billions of archived web pages https://web.archive.org/web/20130401000000*/http://www.cnn.com/ PhD Defense: Mohamed Aturban July 23, 2020 4
  • 5. The CNN archived page from May 30, 2013 • Replaying this memento in 2018 • There was a thunderstorm in Atlanta, GA on May 30, 2013 5
  • 6. 6 When reloading (#1) the memento in the browser, the weather icon changed to “cloudy”
  • 7. 7 When reloading (#2) the memento in the browser, the weather icon changed to “partly sunny”
  • 8. When reloading (#3) the memento in the browser, the weather icon changed to “partly sunny” 8
  • 9. Replaying the same memento multiple times does not always produce the same results! • The changes on the playback of this mementos are caused by JavaScript (JS) being executed on the client side (e.g., the browser) • In this example, each time JS is executed, it loads randomly one of the weather icons 9
  • 10. Textbooks vs. archived pages https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR4FM1VszineUIBCFEQchQTnaZWwKJE7BoUU1u1h3fmrbLdpWl8 A book in a library Replayed mementos 10 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 11. This is what climate.nasa.gov/vital-signs/carbon-dioxide/ looks like today 11
  • 12. This is what it looked like in July 2018 12 https://web.archive.org/web/20180726025537/https://climate.nasa.gov/vital-signs/carbon-dioxide/ A memento created by the Internet Archive in July 2018. It is replayed now (2019).
  • 13. 13 The page in other web archives web.archive.org/web/*/climate.nasa.gov/vital-signs/carbon-dioxide4,870 archive.is/climate.nasa.gov/vital-signs/carbon-dioxide/13 wayback.archive-it.org/all/*/climate.nasa.gov/vital-signs/carbon-dioxide/91 perma-archives.org/warc/*/climate.nasa.gov/vital-signs/carbon-dioxide/4 arquivo.pt/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide3 Typical archive URI construction: archive.example.org/archive-collection/climate.nasa.gov/vital-signs/carbon-dioxide webarchive.loc.gov/all/*/climate.nasa.gov/vital-signs/carbon-dioxide5 Mementos for a full list of public web archives, see: http://labs.mementoweb.org/aggregator_config/archivelist.xml
  • 14. What if we checked these archives? What if they all agree? Would you trust the results? breitbart.com/wayback/*/climate.nasa.gov/vital-signs/carbon-dioxide/ infowars.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ MichaelsEvilWayback.com/web/*/climate.nasa.gov/vital-signs/carbon-dioxide/ InternetResearchAgency.ru/climate.nasa.gov/vital-signs/carbon-dioxide/ 14 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 15. 15 The web page is archived July 2017 by Michael’sEvilWayback Which one is the real memento? Replayed in August 2017 Replayed in October 2017 15
  • 16. 16 It is important to verify fixity of archived resources Evidentiary purposes in court cases • Marten Transport v. PlatForm Advertising • Telewizja Polska USA, Inc. v. Echostar Satellite Corp • St. Luke’s Cataract & Laser Institute v. James C. Sanderson • https://www.bloomberglaw.com/public/desktop/document/Marten_Transp_Ltd_v_PlattForm_Adver_Inc_No_142464JWL_2016_BL_1371?1462657373 • https://casetext.com/case/telewizja-polska-usa-4 • https://caselaw.findlaw.com/us-11th-circuit/1351498.html • https://web.stanford.edu/~gentzkow/research/fakenews.pdf • https://www.nytimes.com/2016/12/05/us/politics/-michael-flynn-trump-fake-news-clinton.html • https://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17 • https://www.newyorker.com/magazine/2015/01/26/cobweb • https://www.datarefuge.org • http://eotarchive.cdlib.org Preserving fake news and important news articles • H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,” Journal of Economic Perspectives, vol. 31, no. 2, pp. 211–36, 2017. • M. Rosenberg, “Trump Adviser Has Pushed Clinton Conspiracy Theories,” The New York Times, 2016 Providing information about certain incidents or crimes • A. Bright, “Web evidence points to pro-Russia rebels in downing of MH17,” The Christian Science Monitor, 2014 Preserving federal and government data • The Data Refuge project is an attempt to preserve federal climate and environmental data • The End of Term Web Archive preserves U.S. Government websites around every new presidential election 16
  • 17. A disclaimer from the Internet Archive stating that the archive is not responsible for the reliability of the archive resources https://archive.org/about/terms.php 1717
  • 18. Web pages change on the live web Time Live Web May 2016 April 2017 April 2018 18 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 19. Archives make copies of web pages Live Web Archive May 2016 April 2017 April 2018 Time 19 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 20. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 Time 20 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 21. Do archived pages change? Live Web Archive Replay May 2016 When replaying the archived page at different points in time, will we get the same content? April 2017 April 2018 Time 21
  • 22. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 22 Time When replaying the archived page at different points in time, will we get the same content?
  • 23. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 23 Time When replaying the archived page at different points in time, will we get the same content?
  • 24. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 24 Time When replaying the archived page at different points in time, will we get the same content?
  • 25. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 25 Time When replaying the archived page at different points in time, will we get the same content?
  • 26. Do archived pages change? Live Web Archive Replay May 2016 April 2017 April 2018 26 Time When replaying the archived page at different points in time, will we get the same content?
  • 27. Do archived pages change? Live Web Archive Replay May 2016 Our study shows that we are not always presented with the same archived content! ? April 2017 April 2018 27 Time 209
  • 28. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 28 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 29. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? Research questions 29 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 30. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? RQ2: Given the types of changes identified in the playback of mementos, what steps/guidelines should we follow in order to generate repeatable fixity information (defining an archive-aware fixity-based approach)? Research questions 30 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 31. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? RQ2: Given the types of changes identified in the playback of mementos, what steps/guidelines should we follow in order to generate repeatable fixity information (defining an archive-aware fixity-based approach)? RQ3: How can we store and retrieve fixity information independently from the web archives from which the associated mementos are preserved? Research questions 31 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 32. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 32 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 33. Generating cryptographic hash values (fixity information) • Common hash algorithms (e.g., MD5, SHA256): A small change in the input à a large change output SHA256 9801 1510 87e1 6d6b ddb9 e6b0 09fd b723 abe5 1fea b548 0914 a130 6325 5ae4 6caa 5d4d b590 605c 9023 000d 6622 6004 534f e84a 5549 d535 f91e cdf4 4952 5c1a 37cf SHA256 33 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 34. 34 SimHash: A small change in the input à a small change in the output 'Klein et al. conducted a study on over one million references from scientific articles and found that 20% articles suffers from Reference Rot, referring to links to web resources that no longer exist or that have significantly modified content.' SimHash 668c8cccd966a785 https://github.com/leonsim/simhash 'Klein et al. conducted a study on over one million references from scientific articles and found that 30% articles suffers from Reference Rot, referring to links to web resources that no longer exist or that have significantly modified content.' SimHash 668c8cced966a785 M. Klein, H. Van de Sompel, R. Sanderson, H. Shankar, L. Balakireva, K. Zhou, and R. Tobin, “Scholarly context not found: One in five articles suffers from reference rot,” PloS one, vol. 9, no. 12, 2014. e115253. http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html • We can use SimHash to compare text-based files and pHash to compare images
  • 35. 35 An example of a binary hash tree (or Merkle tree) https://brilliant.org/wiki/merkle-tree/ • A leaf nodes = the hash of a block of data • A non-leaf node = the hash of its children
  • 36. Generate hashes on a web page • Compute a hash value on the downloaded HTML content $ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256 17710fd38d908a3cd124510f26adaec67e57e3f1d3aec1209c4ad4efbe2c035d Compute SHA256 hashDownload the page 36 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 37. Fixity information Verifying the fixity of a web page Hashes are NOT identical à the page has changed! • Compare the current hash with previously calculated hash 37 Time HTML content is downloaded e834 c71a efda 284f e03a 4eed 4e8c b78e a581 537b a888 4aec ec29 bd2d 66cb f521 SHA256 Hash HTML content is downloaded fc90 88b3 a614 a588 40bd 5387 d93c 16be 824c d2bb b3fa b173 f93f a57d 241a 3790 SHA256 Hash August 2017 October 2017 The archived page has been tampered with by changing the value of COSeptember 2017 2
  • 38. Verifying the fixity of a web page Hashes are NOT identical à the page has changed! • Compare the current hash with previously calculated hash 38 - Users of web archives do not have the ability to easily verify the fixity of mementos. - Most web archives do not allow accessing fixity information - Even if fixity information is available, it is not from an independent archive or service.
  • 39. What if an image has changed? • Computing hashes on only HTML content will NOT detect changes 39
  • 40. Potential solution: include all resources in hash calculation • 201 images • 19 JavaScript files • 3 CSS files • Base HTML file A single aggregated hash value Consists of Turns out it is hard to get repeatable hashes on composite mementos A composite memento • www.gwern.net/Timestamping (Existing tools for generating a hash value on a composite archived page) • https://web.archive.org/web/20170717184643/https://climate.nasa.gov/vital-signs/carbon-dioxide/ • https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html • http://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html 40
  • 41. Archives add banners • To convey information like the number of mementos and inform users that what they are viewing is from the archive • Banners change à different hashes Replayed in 2016 (43 mementos) Replayed in 2017 (49 mementos) http://webarchive.proni.gov.uk/20150826163149/http://www.ulster.ac.uk 41
  • 42. Archives transform original content to appropriately replay mementos in a user’s browser • Add banners • Rewrite links to point to the archive, not to the live web • Add HTML tags to convey metadata • Archives use one of the Wayback Machine’s implementations to replay mementos • https://archive.org/web/ • https://github.com/iipc/openwayback/wiki • https://github.com/ikreymer/ PyWb 42 @maturban1 • @WebSciDL
  • 43. Rewriting original content by archives’ replay tools An image A CSS file The page is captured by the Internet Archive: https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html 4343 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 47. Live web The Archive https://web.archive.org/web/20190725212938id_/https://maturban.git hub.io/playground/index.html Raw Mementos • Many archives allow accessing original, or raw, archived content • E.g., using the option id_ after the timestamp 47
  • 48. We need an archive-aware hashing function suitable for mementos Archive Repeatable hash value? JavaScript Michael’sEvilWayback Transform ation 48 Security
  • 49. Archive May 2016 April 2017 April 2018 Time Live Web TimeMap • Defined by Memento framework (an Internet RFC) • A TimeMap for an Original Resource “as a resource from which a list of URIs of Mementos of the Original Resource is available.” 49 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020 https://climate.nasa.gov/vital-signs/carbon-dioxide/
  • 50. Archive May 2016 April 2017 April 2018 Time Live Web The TimeMap of the resource climate.nasa.gov/vital-signs/carbon-dioxide/ has three mementos TimeMap • Defined by Memento framework (an Internet RFC) • A TimeMap for an Original Resource “as a resource from which a list of URIs of Mementos of the Original Resource is available.” 50
  • 51. Memento aggregators • Aggregate TimeMaps, of an Original Resource, from multiple archives into a single TimeMap • LANL Memento aggregator ⁃ http://mementoweb.org/depot/ ⁃ https://github.com/oduwsdl/MemGator • MemGator 51 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 52. Downloading the TimeMap of climate.nasa.gov/vital- signs/carbon-dioxide/ using MemGator web.archive.org4,870 archive.is13 wayback.archive-it.org91 perma-archives.org4 arquivo.pt3 webarchive.loc.gov5 Mementos http://timetravel.mementoweb.org/timemap/link/climate.nasa. gov/vital-signs/carbon-dioxide/ 52
  • 53. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 53 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 54. 54 Sampling of Related Work TRAC (2007) Establishing trusted archives - TRAC not for playback Lerner et al. (2017) Vulnerabilities - Discovered four vulnerabilities in the Internet Archive’s Wayback Machine J. Cushman et al. (2017) More potential threats - Demonstrate potential threats in web archives Rosenthal et al. (2005) Threats - Described several threats against digital preservation systems Juan Benet (2017) Multihash - Self identifying hashes for IPFS OriginStamp, Gipp (2015, 2016) Trusted timestamps in Blockchain - Not suitable for composite mementos T. Kuhn et al. (2014) Trusty URI - A URI that contains a hash value of the content it identifies P. Maniatis et al. (2005) Distributed copies of archived resources (LOCKSS) - The scope and content are narrowly defined opentimestamps.org/ (2017) OpenTimestamps - Not suitable for composite mementos chainpoint.org (2017) Chainpoint - Not suitable for composite mementos Collomosse et al. (2018) ARCHANGEL - For mementos, but not suitable for composite mementos Trusted timestampingSecurity Standards and other systems Identity derived from content Hamano et al. (2005) Git, Distributed version control - Uses hash values to create commits identifiers Web archives, such as webcitation.org, and archive.is, use hash values in URIs to identify mementos Brunelle (2010) Live web leakage in archives - Describes how live web leakage changes the representation of mementos Rosenthal et al. (2005) Requirements for establishing trusted digital preservation systems - Not for playback OAIS (2012) Reference Model For An Open Archival Information System (OAIS) - Not for playback 54
  • 55. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 55 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 56. RQ1: Can we identify and quantify the types of changes on the playback of mementos that prevent generating repeatable fixity information? Identifying and quantifying changes on the playback of mementos Collect a dataset of mementos Download rewritten/raw composite mementos Identify changes Present results 39 Times Generate aggregated hash values 1 2 3 4 5 M. Aturban, M. L. Nelson, and M. C. Weigle, “It is hard to compute fixity on archived web pages,” in Proceedings of the Workshop on Web Archiving and Digital Libraries (WADL) held in conjunction with the 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2018. 56PhD Defense: Mohamed Aturban July 23, 2020
  • 57. Collecting 16,627 Mementos M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019. • The HTTP Archive: httparchive.org • The Web Archives for Historical Research: uwaterloo.ca/web-archive-group/ • Not all mementos are created equal: measuring the impact of missing resources, J. Brunelle et al. (DOI: doi.org/10.1007/s0079) • The Moz Top 500 Websites: moz.com/top500 Sources of URI-Rs: Collect a dataset of mementos 1 57
  • 58. Downloading mementos using Headless Chrome http://web.archive.org/web/19961120150251/http://www.usnews.com:80/ https://github.com/N0taN3rd/Squidwarc rewritten.warc Squidwarc Download rewritten/raw composite mementos 2 58
  • 59. 59 Extract all URI-Ms by reading WARC records using the tool warcio Download rewritten/raw composite mementos 2 rewritten.warc https://github.com/webrecorder/warcio
  • 60. 60 Requesting the raw mementos of x ✓ ✓ ✓ x x x x x x x x x x x ✓ x x x x ✓ 200 Ok (or archival 4xx/5xx)✓ raw.warc Using id_ X = Archive-specific resources X = 3xx Redirect Download rewritten/raw composite mementos 2
  • 61. 61 Generate the aggregated hash with Merkle trees Generate aggregated hash values 3
  • 62. 62 Identifying types of changes on the playback of mementos Set: One or more resources in the set comprising a composite memento has changed Status code: The HTTP status code of one or more resources comprising a composite memento has changed HTTP Headers: One or more HTTP response headers, that we do not expect to change, has changed Representation: The returned HTTP entity body of one or more resources comprising a composite memento has changed URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-DateTime Identify changes 4
  • 63. 63 Set: One or more resources in the set comprising a composite memento has changed https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Reload # 1 Identify changes 4 63
  • 64. https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Reload # 2 Set: One or more resources in the set comprising a composite memento has changed Identify changes 4 64
  • 65. https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Reload # 3 Set: One or more resources in the set comprising a composite memento has changed Identify changes 4 65
  • 66. A resource selected randomly by JavaScript Reload # 3 function random_imglink(){ myimages[1]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home- banner/open-spaces/bannerbluemnt.jpg"; myimages[2]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home- banner/open-spaces/bannereagle.jpg"; myimages[3]="/congress112th/20130119060624/http://www.fws.gov/home/feature/home- banner/open-spaces/bannertiger.jpg"; var ry=Math.floor(Math.random(1)*myimages.length) if (ry==0) ry=1 document.write('<a href='+'"'+imagelinks[ry]+'"'+'><img src="'+myimages[ry]+'" border="0" alt="The Open Spaces Blog. A Talk on the Wild Side. Click to Read"></a>’) } https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ Identify changes 4 66
  • 67. Status code: The HTTP status code of one or more resources comprising a composite memento has changed https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/ 404 200 Identify changes 4 67
  • 68. Status code: The HTTP status code of one or more resources comprising a composite memento has changed https://web.archive.org/web/20141209193553im/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aarona042209eb200.jpg https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/ 404 200 WARC/1.0 WARC-Type: response WARC-Target-URI: https://web.archive.org/save/_embed/http://wac.450F.edgecas tcdn.net/80450F/noisecreep.com/files/2009/06/aaron_a042209eb _200.jpg WARC-Date: 2017-11-18T10:33:14Z … HTTP/1.1 200 OK Date: Sat, 18 Nov 2017 10:32:51 GMT Content-Type: image/jpeg Content-Location: /web/20171118103250/http://wac.450F.edgecastcdn.net/80450F/n oisecreep.com/files/2009/06/aaron_a042209eb_200.jpg Observations change archives Identify changes 4 68
  • 69. Headers: One or more HTTP response headers, that we do not expect to change, has changed https://web.archive.org/web/20071111211818/http:// images.sohu.com:80/chat_online/market/sohu/140140-1.html Replayed in 2017 Replayed in 2018 Identify changes 4 69
  • 70. Representation: The returned HTTP entity body of one or more resources comprising a composite memento has changed curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:45 GMT <a href="/cdn-cgi/l/email- protection#28175b5d4a424d4b5c15690854086905720861464c4d5008474e087d067b06086f475e4d5a464 54d465c086c4d58495a5c454d465c5b0849464c08694f4d464b414d5b0e494558134a474c5115405c5c585b1 207075f5f5f065d5b49064f475e074e4d4c4d5a494405494f4d464b414d5b0749” … curl -s http://perma-archives.org/warc/20171026200017id_/https://www.usa.gov/federal- agencies/a | egrep -i "(cdn-cgi|^Date:)" Date: Tue, 15 May 2018 21:00:50 GMT <a href="/cdn-cgi/l/email- protection#68571b1d0a020d0b1c55294814482945324821060c0d1048070e483d463b46482f071e0d1a060 50d061c482c0d18091a1c050d061c1b4809060c48290f0d060b010d1b4e090518530a070c1155001c1c181b5 247471f1f1f461d1b09460f071e470e0d0c0d1a090445090f0d060b010d1b4709” … Requesting the raw version, a third party service (Cloudflare) modifies the HTML Identify changes 4 70
  • 71. Live Web Archive Replay May 2016 April 2017 April 2018 Time URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 71PhD Defense: Mohamed Aturban July 23, 2020
  • 72. 72 Live Web Archive Replay May 2016 April 2017 April 2018 Time URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 72PhD Defense: Mohamed Aturban July 23, 2020
  • 73. 73 Live Web Archive Replay May 2016 April 2017 April 2018 Time URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 73PhD Defense: Mohamed Aturban July 23, 2020
  • 74. Live Web Archive Replay May 2016 April 2017 April 2018 Time X 302 Redirect URI-M: One or more resources in the set comprising a composite memento redirects to a different memento with a different Memento-Datetime Identify changes 4 74PhD Defense: Mohamed Aturban July 23, 2020
  • 76. 76 https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/ URI-M1 = web.archive.org/web/20110116134258id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G URI-M2 = web.archive.org/web/20120121090532id/http://1.gravatar.com/avatar/117a6cc4203b951f11fc 43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G December 2017 March 2018 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2 URI-M1 was NOT available Identify changes 4
  • 77. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113 /name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ 77 Identify changes 4
  • 78. December 12, 2017 302 Redirect Requesting URI-M1 Requesting URI-M1 URI-M2December 25, 2017 URI-M1 = perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113 /name/Resea rchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M2 = perma-archives.org/warc/20170619145458id_/http://umich.edu/includes/image/type/gallery/id/113 /name/ResearchDIL19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit/ URI-M1 was NOT available Different image 78 Identify changes 4
  • 79. 79 88.45% of 16,627 mementos produce at least two different hashes Present results 5
  • 80. 80 One in eight mementos (11.55%) always produce the same hash and one in six mementos (16.06%) produce a different hash on each replay blue=11.55% (1,920 mementos) red=16.06% (2,670 mementos) Present results 5
  • 81. The types of changes affecting mementos after each download Present results 5 81
  • 82. Migrated and missing mementos (11.91%) Present results 5 • https://ws-dl.blogspot.com/2019/08/2019-08-30-where-did-archive-go-part1.html • https://ws-dl.blogspot.com/2019/09/2019-09-10-where-did-archive-go-part-2.html • https://ws-dl.blogspot.com/2019/09/2019-09-25-where-did-archive-go-part-3.html • https://ws-dl.blogspot.com/2019/10/2019-10-21-where-did-archive-go-part-4.html Our blog posts about the movements of mementos: 82
  • 83. Because most mementos produce multiple aggregated hash values over time, we introduce two additional hashing techniques • URI-M-based hashing technique Only URI-Ms of mementos comprising a composite memento are used in the hash calculation • Entity-based hashing technique Only HTTP entity bodies of mementos comprising a composite memento are used in the hash calculation 83 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 87. 87 Expected results Complete hashing on mementos from archive.org
  • 88. 88 Expected results Complete hashing on mementos from archive.org
  • 89. 89 Complete hashing on mementos from archive.org Expected results
  • 90. 90 Complete hashing on mementos from archive.org New hash values calculated in each download (median = 871 hash values) 90 Only 47% of the total number of hash values are seen in Download 1
  • 91. 91 URI-M-based hashing on mementos from archive.org New URI-Ms requested in each download (median = 806 URI-Ms) 91 Only 50% of the total number of URI-Ms are requested in Download 1
  • 92. 92 Entity-based hashing on mementos from archive.org New entity bodies observed in each download (median = 116 entity body) 92 About 80% of the total number of entity bodies are seen in Download 1
  • 93. RQ2: Given the types of changes identified in the playback of mementos, what steps/requirements should we consider in order to generate repeatable fixity information (defining an archive-aware fixity-based approach)? Research questions 93 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 94. Guidelines for generating fixity information on the playback of mementos • We define these guidelines based on results from our 14-month study 94
  • 95. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 95 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 96. RQ3: How can we store and retrieve fixity information independently from the web archives from which the associated mementos are preserved? M. Aturban, S. Alam, M. L. Nelson, and M. C. Weigle, “Archive Assisted Archival Fixity Verification Framework,” in Proceedings of the 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019. 96 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 97. Two approaches for disseminating and verifying the fixity of archived web pages (using web archives to monitor web archives) • The Atomic approach • Generate a manifest file (a JSON file containing the fixity information) for each memento • Publish the manifest at a well-known location • Disseminate the published manifest to several archives • The Block approach • Batch together fixity information of multiple mementos in a single binary- searchable file (or block) • Publish the block at a well-known location • Disseminate the published block to several archives 97 @maturban1 • @WebSciDL
  • 98. { "@context": "http://oduwsdl.github.io/contexts/fixity", "created_at": "Wed, 08 Apr 2020 02:22:56 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20200408022256/d5d...f04/https://web.archive.org/web/19970104075414/ http://www.un.org:80/", "composite-memento-uri-m-hash": "8bb453d8aa...db5f5fbf6", "composite-memento-entity-hash": "cdf47062a...3ebe030b0", "composite-memento-overall-hash": "69ca5a85...b9206930", "uri-m": "https://web.archive.org/web/19970104075414/http://www.un.org:80/", "resources": [ { "http-headers": ["X-Archive-Orig-last-modified", "Content-Type", "X-Archive-Orig-date", "Memento-Datetime"], "entity-phash": null, "entity-hash": "ba140a5bede7f10bea0...7514725862eda82a003", "overall-hash": "3e4133b3766c2a58d6f...23f5f95a206a1ba9878", "entity-simhash": 9695187482751709335, "uri-m": "https://web.archive.org/web/19970104075414/http://www.un.org:80/", "status-code": "200"}, { "http-headers": ["X-Archive-Orig-last-modified", "X-Archive-Orig-date", "Memento-Datetime", "Content-Type"], "entity-phash": "87d5798529a75a58", "entity-hash": "d9305a4da88570700d92...17c0c0f72b9f4e514b9", "overall-hash": "de002fbe292372199e6...6059044f4695f0dda2c", "entity-simhash": null, "uri-m": "https://web.archive.org/web/19970315165323im_/http://www.un.org/homepage.gif", "status-code": "200"}, { "http-headers": ["Content-Type", "Memento-Datetime"], "entity-phash": "219d1a8362a71040", "entity-hash": "d5fd59c929e1c62b17b...d5e321f64a919e4294e", "overall-hash": "7df9dde47fa5bab643...cef3c84694bb1db8b1c", "entity-simhash": null, "uri-m": "https://web.archive.org/web/20120129120857/http://web.archive.org/screenshot/http://www.un.org/", "status-code": "200"} ] } A manifest example containing fixity information 98
  • 99. Atomic approach: Push manifests into multiple archives • In this example, the memento is in the Internet Archive (IA) and its fixity information is disseminated to four archives including IA • An attacker would have to hack a majority of 4 domains (archives) https://archive.is/20181224093334/http://manifest. ws-dl.cs.odu.edu/manifest/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ https://web.archive.org/web/20181224093355/http:// manifest.ws-dl.cs.odu.edu/manifest/https://web.arc hive.org/web/20181224085329/https://2019.jcdl.org/ https://perma-archives.org/warc/20181224093354/htt p://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl. org/ http://www.webcitation.org/74tvdsyxemanifest.ws-dl.cs.odu.edu/ manifest/https://web.archi ve.org/web/20181224085329/ https://2019.jcdl.org/ 99
  • 100. Block approach: Batch together fixity information of multiple mementos in a single file (block) • Adding additional metadata (e.g., created_at, fields, …) • The hash of the previous block must be added !context ["http://oduwsdl.github.io/contexts/fixity"] !fields {keys: ["surt"]} !id {uri: "https://manifest.ws-dl.cs.odu.edu/"} !meta {created_at: "20190111181327"} !meta {prev_block:"sha256:d4eb1190f9aaae9542...845b632eb2b3f4f098a34144d"} !meta {type: "FixityBlock"} org,archive,web)/web/19961022175434/http://search.com org,archive,web)/web/19961219082428/http://sho.com org,archive,web)/web/19961223174001/http://reference.com … 100
  • 101. Block approach: Push the blocks entrypoint into multiple archives manifest.ws-dl.cs.odu.edu/blocks https://web.archive.org/web/20190121054059/https ://manifest.ws-dl.cs.odu.edu/blocks/7bbf757046ac 0a0a60015a1cb847c3189160d18c809b210073822df15760 9e01 • Will result in archiving the latest published block as well https://perma.cc/8YG3-X7KN 101 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 102. Three steps to verify the fixity of a memento 1. Discover a manifest/block • In Atomic approach, this includes discovering the archived manifests 2. Compute current fixity information of the memento 3. Compare current fixity information with the discovered manifests/block. $ curl -sI https://manifest.ws-dl.cs.odu.edu/manifest/https://web.archive.org/web/ 20171115140705/http://rln.fm/ | egrep -i "(HTTP/|^location:)” HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181212074423/bd669de8835e38 d54651fe9d04709515beec0c727db82a5366f4bc2506e103d8/https://web.archive.org/web/201 71115140705/http://rln.fm/ An example of discovering the latest manifest in the Archival Fixity server for the memento: https:/web.archive.org/web/20171115140705/http://rln.fm/ 102
  • 103. Evaluation • A data set of 16K mementos from 17 public web archives • For each memento, we generated and disseminated a manifest to 3 archives - The median size of a composite memento is 1143.85 KB - The median size of a manifest file is 15.29 KB, which represents 1.33% of the size of a composite memento 103
  • 104. Increasing the number of records per block reduces the block generation time 104
  • 105. The Block approach creates fewer resources in archives than the Atomic approach • Given a collection of N = 16,608 mementos • Katomic = 3 web archives • Kblock = 2 web archives • The selected block size B = 1038 records per block • The total number of resources created in the archives by each approach: Atomic (N ∗ Katomic) = 49,824 Block (Kblock ∗ (N/B)) = 32 105 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 106. It takes 1.09X and 4.54X longer to disseminate a manifest to perma.cc, archive.org, respectively, than archive.is 106 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 107. It takes 9.2x longer to disseminate a block to archive.org than perma.cc 107 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 108. The Block approach performs 1.05X faster than the Atomic approach on verifying the fixity of mementos Discovering and downloading manifest files in the Atomic/Block approaches per archive Verifying mementos by both approaches 108
  • 109. Outline Introduction and Motivation Research Questions Background Sampling of Related work Changes in the Playback of Mementos The Fixity Information Dissemination Framework Contributions, Future Work, and Conclusions 109 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 110. Contributions • RQ1 - Four methods for collecting mementos (arXiv’19) - Identified and quantified types of changes on the playback of mementos (JCDL/WADL’18) - Showed examples of missing mementos • RQ2 - The two hashing techniques (URI-M-based and entity-based) - The archive-aware hashing function (arXiv’17) • RQ3 - ArchiveNow, a tool for disseminating web pages in public web archives (JCDL’18) - A framework for disseminating fixity information to web archives (JCDL’19) 110 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 111. Future Work Investigating web packaging generating fixity information using • Web packaging is an emerging standard • It should allow archives to deliver a composite memento in a single HTTP response or in a self-contained file • Using web packaging we can download a composite memento, packaged in a bundle, with a single HTTP request. This should reduce playback-related changes, such as transient errors and URI-M changes. 111PhD Defense: Mohamed Aturban July 23, 2020
  • 112. Conclusions • Conventional hashing techniques are not suitable for replayed archived web pages. • We defined an archive-aware hashing function that consists of multiple guidelines (based on our 14-month study on 16K mementos) • Fixity information includes (1) Multiple aggregated hash values generated using different hashing techniques (URI-M-based and entity-based hashing) (2) Multiple hash values generated on each resource comprising a composite memento • We introduce two approaches for disseminating fixity information to web archives 112 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 113. The archive-aware hashing function 113 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 114. PhD Dissertation Defense for: Mohamed Aturban Advisor: Michele C. Weigle Committee Members: Michele C. Weigle, Michael L. Nelson, Jian Wu, Sampath Jayarathna, and M'Hammed Abdous A Framework for Verifying the Fixity of Archived Web Resources Department of Computer Science Norfolk, Virginia 23529 USA July 23, 2020
  • 116. Collecting 16,627 Mementos M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019. • The HTTP Archive: httparchive.org • The Web Archives for Historical Research: uwaterloo.ca/web-archive-group/ • Not all mementos are created equal: measuring the impact of missing resources, J. Brunelle et al. (DOI: doi.org/10.1007/s0079) • The Moz Top 500 Websites: moz.com/top500 Sources of URI-Rs: Collect a dataset of mementos 1 116
  • 117. http://collections.internetmemory.org/nli/ 20121223031837/http://www2008.org/ • Mementos from the National Library of Ireland (NLI) collection has been moved from collections.internetmemory.org/nli/ to wayback.archive-it.org/10702/ An example of a missing memento • The URI-M was 200 OK in September 2018 http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/ • The URI-M is now 404 Not Found 117 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 118. 118 - The heatmap shows archive-level changes by comparing consecutive downloads of mementos - It visualizes the overall performance of each archive - It identifies points in time where major changes occur Present results 5
  • 119. URI-Rs with different path lengths and URI-Ms with different Memento-Datetime 119 M. Aturban, M. L. Nelson, M. C. Weigle, M. Klein, and H. Van de Sompel, “Collecting 16K archived web pages from 17 public web archives,” Tech. Rep. arXiv:1905.03836, May 2019. URI-Ms per year URI-Rs per path length Select a dataset of mementos 1
  • 120. 120
  • 121. Downloading the ZIP file of a memento at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. http://archive.is/download/BRWpm.zip http://archive.is/BRWpm Representation: The returned HTTP entity body of one or more resources comprising a composite memento has changed Identify changes 4 121
  • 122. Downloading the ZIP file of a memento at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/ http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg Representation (transient errors) Identify changes 4 122
  • 123. http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/ http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchives.gov.uk/20 170303010736id_/https://cereals.ahdb.org.uk/ media/1157842/corporate-strategy-1.jpg WARC-Date: 2017-12-07T10:04:18Z … Content-Length: 459640 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 07 Dec 2017 10:04:18 GMT … The first Content-length should be bigger than the second one WARC/1.0 WARC-Type: response WARC-Target-URI: http://webarchive.nationalarchives.gov.uk/ 20170303010736id_/https://cereals.ahdb.org .uk/media/1157842/corporate-strategy-1.jpg WARC-Date: 2017-11-16T15:34:37Z … Content-Length: 643398 HTTP/1.0 200 Content-Type: image/jpeg Content-Length: 642336 Date: Thu, 16 Nov 2017 15:34:36 GMT … This is what it should look like Identify changes 4 Downloading the ZIP file of a memento at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. Representation (transient errors) 123
  • 124. 124 Identify changes 4 Archives react differently to requests for raw mementos
  • 125. The Block approach performs 4.46x faster than the Atomic approach in verifying the fixity of mementos • The fixity verification time includes: - Discovering manifests - Computing current fixity information - Downloading the archived manifests - Comparing results • On average, the verification time of a memento is 6.65 seconds by the Atomic approach and 1.49 seconds by the Block approach @maturban1 • August 22, 2019 A Framework for Verifying the Fixity of Archived Web Resources
  • 126. { "@context": "http://manifest.ws-dl.cs.odu.edu/", "created": "Sun, 23 Dec 2018 11:43:55 GMT", "@id": "http://manifest.ws-dl.cs.odu.edu/manifest/20181223114355/c6ad485819abb e20e37c0632843081710c95f94829f59bbe3b6ad3251d93f7d2/https://web.archiv e.org/web/2018121102034/https://2019.jcdl .org/", "uri-r": "https://2019.jcdl.org/", "uri-m": "https://web.archive.org/web/20181219102034/https://2019.jcdl.org/", "memento-datetime": "Wed, 19 Dec 2018 10:20:34 GMT", "http-headers": { "Content-Type": "text/html; charset=UTF-8", "X-Archive-Orig-date": "Wed, 19 Dec 2018 10:20:36 GMT", "X-Archive-Orig-link": "<https://2019.jcdl.org/wp-json/>; rel="https://api.w.org/"", "Preference-Applied": "original-links, original-content” }, "hash-constructor": "(curl -s '$uri-m' && echo -n '$Content-Type $X-Archive- Orig-date $X-Archive-O rig-link') | tee >(sha256sum) >(md5sum) >/dev/null | cut -d ' ' -f 1 | paste -d':’ <(echo -e 'md5nsha256') - | paste -d' ' - -", "hash": "md5:969d7aba4c16444a6544bdc39eefe394 sha256:c68a215eb1c3edbf51f565b9 a87f49646456369e51791a86106a6667630737a6" } A manifest file example • Including how hashes are computed • Hashes are computed on only base HTML file • Compute fixity on things that should not change like certain original HTTP response headers 126
  • 127. 127 • Using web packaging we can download a composite memento, packaged in a bundle, with a single HTTP request. This should reduce playback-related changes, such as transient errors and URI-M changes.
  • 128. 128
  • 129. Memento framework • Uses time as a dimension to access the web by relating current web resources to their prior states • Is supported by most public web archives including the Internet Archive http://mementoweb.org/guide/quick-intro/ 129 @maturban1 • @WebSciDL
  • 130. 130 URI-M-based hashing on mementos from archive-it.org Expected results Actual results
  • 131. 131 New URI-Ms are requested in each download URI-M-based hashing on mementos from archive-it.org Actual results
  • 132. 132 New URI-Ms are requested in each download URI-M-based hashing on mementos from archive-it.org Expected results Actual results
  • 133. 133 Atomic approach (step 1): Push a web page into multiple archives https://archive.is/20181224085310/ https://2019.jcdl.org/ https://web.archive.org/web/201812 24085329/https://2019.jcdl.org/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ http://www.webcitation.org74tsy6pU0 https://2019. jcdl.org/ This results in creating multiple mementos of the web page Archive Assisted Archival Fixity Verification Framework ∙ JCDL 2019 ∙ June 4, 2019 ∙ Urbana-Champaign, Illinois
  • 134. Atomic approach (steps 2 & 3): For each memento, compute fixity “manifest” and publish it on the web at a well-known location (Archival Fixity server) manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 • In this example https://manifest.ws-dl.cs.odu.edu is the Archival Fixity server • Actual URIs to manifests can be a bit more complex using “Trusty URIs”: http://ws- dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html 134
  • 135. manifest.ws-dl.cs.odu.edu/manifest/ https://archive.is/20181224085310/h ttps://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://web.archive.org/web/2018122 4085329/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ https://perma-archives.org/warc/201 81224085330/https://2019.jcdl.org/ manifest.ws-dl.cs.odu.edu/manifest/ http://www.webcitation.org/74tsy6pU0 The manifest’s generic URI always redirects to the most recent time-specific manifest version (trusty URI) $curl -sIL https://manifest.ws-dl.cs.odu.edu/manifest/https://web .archive.org/web/20181224085329/https://2019.jcdl.org/ | egrep -i "(HTTP/|^location:)" HTTP/2 302 location: https://manifest.ws-dl.cs.odu.edu/manifest/20181224093024/8c31ccfbb3a664c9 160f98be466b7c9fb9a fa80580ab5052001174be59c6a73a/https://web.archive.org/ web/20181224085329/https://2019.jcdl.org/ HTTP/2 200 manifest’s trusty URI manifest’s generic URI • The structure of generic URIs is easy to remember: <Archival-Fixity-Server>/<URI to memento> So they can be used to look up manifests in both the Archival Fixity server and archives 135 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 136. Block approach (step 2): Publish the block file at the Archival Fixity server manifest.ws-dl.cs.odu.edu/blocks The blocks entrypoint always redirects to the latest published block 136
  • 137. The dissemination/download time varies from one archive to another 137 @maturban1 • @WebSciDL PhD Defense: Mohamed Aturban July 23, 2020
  • 138. 138
  • 139. Links 139 • Animated GIFs: https://github.com/oduwsdl/mementos-fixity/tree/master/hashing_techniques