Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03

CLEAR+: a Credible Live
Evaluation Method of
Website Archivability
Vangelis Banos, Yannis Manolopoulos
3 JUNE 2015
NATIONAL DIGITAL INFORMATION INFRASTRUCTURE AND PRESERVATION PROGRAM
LIBRARY OF CONGRESS
Data Engineering Lab
Department of Informatics, Aristotle University, Thessaloniki , Greece
ARCHIVEREADY.COM

Website Archivability 2
Table of Contents
1. Motivation and problem definition, related work,
2. Website Archivability,
3. CLEAR+: A Credible Live method to Evaluate
Website Archivability,
4. Demonstration: http://archiveready.com/,
5. Experimental Evaluation,
6. Use Cases,
7. Web Content Management Systems Archivability
8. Discussion – conclusions.

1. Motivation
• Web developer: I’m building a website. Is it
going to be archived correctly by a web archive?
I don’t know until I see the archived snapshot…
• Web archivist: Can I archive that website?
I don’t know, let’s crawl it and we’ll see the results…
• Professor: How can I teach my students about
web archiving?
100’s of standards but not many relevant apps online…
3

Problem definition
• Web content acquisition is a critical step in the
process of web archiving;
• If the initial Submission Information Package lacks
completeness and accuracy for any reason (e.g.
missing or invalid web content), the rest of the
preservation processes are rendered useless;
• There is no guarantee that web bots dedicated to
retrieving website content can access and retrieve
it successfully;
• Web bots face increasing difficulties in harvesting
websites.
4

5
• Web harvesting is automated while Quality Assurance
(QA) is mostly manual.
• Web archives perform test crawls.
• Humans review the results, resources are spent.
• After web harvesting, administrators review manually
the content and endorse or reject the harvested
material.
• Efforts to deploy crowdsourced techniques to
manage QA provide an indication of how significant the
bottleneck is.
• (IIPC GA 2012 Crowdsourcing Workshop)
Problem definition

6
1. the introduction of the notion of Website Archivability,
2. the Credible Live Evaluation of Archive Readiness Plus
(CLEAR+) method to measure Website Archivability
3. ArchiveReady.com, a web application which implements
the proposed method.
Publications:
• Banos V., Manolopoulos Y.: A quantitative approach to
evaluate Website Archivability using the CLEAR+ method,
International Journal on Digital Libraries (IJDL), 2015.
• Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a
credible method to evaluate website archivability,
iPRES’2013, Lisbon, 2013.
2. Our Contributions

7
1. Mechanism to improve the quality of web archives.
2. Expand and optimize the knowledge and practices of
web archivists, supporting them in their decision
making, and risk management.
3. Standardize the web aggregation practices of web
archives, especially QA.
4. Foster good practices in web development, make
sites more amenable to harvesting, ingesting, and
preserving.
5. Raise awareness among web professionals regarding
preservation.
6. Support web archiving training.
Our Aims

Website
Archivability ?
What is
Website Archivability (WA) captures the core
aspects of a website crucial in diagnosing
whether it has the potentiality to be archived
with completeness and accuracy.
Attention! it must not be confused with website dependability,
reliability, availability, safety, security, survivability, maintainability.

CLEAR+: A Credible Live Method to
Evaluate Website Archivability
• An approach to producing on-the-fly measurement
of Website Archivability,
• Web archives communicate with target websites via
standard HTTP,
• Information such as file types, content and transfer
errors could be used to support archival decisions,
• We combine this kind of information with an
evaluation of the website's compliance with
recognized practices in digital curation,
• We generate a credible score representing the
archivability of target websites.
9

The main components of CLEAR+
1. WA Facets: the factors that come into play and
need to be taken into account to calculate total WA.
2. Website Attributes: the website homepage
elements analysed to assess the WA Facets (e.g. the
HTML markup code).
3. Evaluations: the tests executed on the website
attributes (e.g. HTML code validation against W3C
HTML standards) and approach used to combine
the test results to calculate the WA metrics.
10

11
Accessibility Cohesion
Standards
Compliance
Metadata
CLEAR+: A Credible Live Method to Evaluate
Website Archivability

12
Website attributes evaluated using CLEAR+

13
CLEAR+ Evaluations
1. Perform specific Evaluations on Website Attributes,
2. In order to calculate each Archivability Facet’s score:
• Scores range from (0 – 100%),
• Evaluations significance varies:
• High: critical issues which prevent web crawling or may
cause highly problematic web archiving results.
• Medium: issues which are not critical but may affect the
quality of web archiving results.
• Low: minor details which do not cause any issues when
they are missing but will help web archiving when
available
3. Website Archivability is the average of all Facets’ scores.

Accessibility
• A website is considered accessible only if web
crawlers are able to visit its home page, traverse its
content and retrieve it via standard HTTP requests.
• Performance is also an important aspect of web
archiving. Faster performance means faster web
content ingestion.
15

Cohesion
• Relevant to:
• Efficient operation of web crawlers,
• Management of dependancies with digital
curation.
• If files constituting a single website are dispersed
across different web locations, the acquisition and
ingest is likely to risk suffering if one or more web
locations fail.
• Changes that occur outside the website are not
going to affect it if it does not use 3rd party
resources.
18

Metadata
• The adequate provision of metadata has been a
continuing concern within digital curation.
• The lack of metadata impairs the archive’s ability to
manage, organise, retrieve and interact with content
effectively.
• Metadata may include descriptive or technical
information.
• Metadata increases the probability of successful
information extraction and reuse in web archives
after ingestion.
21

Standards Compliance
• Compliance with standards is a recurring theme in
digital curation practices. It is recommended that for
digital resources to be preserved they need to be
represented in known and transparent standards.
24

Standards Compliance Evaluations
25

ArchiveReady.com
4. Demonstration
- Web application implementing CLEAR+,
- Web interface & also Web API in JSON,
- Running on Linux, Python, Nginx, Redis, Mysql,
PhantomJS headless browser.
26

5. Experimental evaluation
• Questions:
– How can we prove the validity of the Website
Archivability metric?
– Is it possible to calculate the WA of a website by
evaluating a single webpage?
29

Experiment 1: Evaluation using datasets
30

Experiment 1: Evaluation using datasets
31

Experiment 2: Evaluation by experts
• Experts rank 200 websites according to the quality
of their snapshots at the Internet Archive
• We evaluate the same websites with
archiveready.com
• We calculate the Pearson’s Correlation Coefficient
of our variables and find correlations.
32

Experiment 3: WA variance in the pages
of the same website
• We evaluate only a single webpage to
calculate website archivability. Is this correct?
• Is the homepage WA representative of the
whole website WA?
• We use a website of 800 webpages and
calculate the WA of 10 different webpages for
each website to find out.
33

Experiment 3: WA variation in the
pages of the same website
34

Use Case 1: Deutsches Literatur Archiv,
Marbach, Germany
• German literature web archiving project,
• http://www.dla-marbach.de/dla/bibliothek/literatur_im_netz/netzliteratur/
• ~3.000 websites are preserved,
• An evaluation of the archivability
characteristics of these websites was
necessary before crawling,
• archiveready.com API was used to gain an
insight on their properties
http://archiveready.com/docs/api.html
36

Use Case 2: Academia
• Used by digital curation units, researchers and
teachers.
– University of Newcastle, UK,
– Columbia University Libraries,
– Stanford University Libraries,
– University of Michigan, Bentley Historical Library,
– Old Dominion University.
37

Web CMS Archivability
• CMS dominate the web
– (Wordpress, Drupal, Joomla, MovableType, +++)
• CMS constitute a common technical
framework for web publishing.
• If a CMS is ‘incompatible’ with some web
archiving aspect, millions of websites are
affected and web archives suffer.
39

• Our contribution:
– We study 12 prominent web CMS.
– We conduct experiments with a sample of ~5.800
websites based on these CMS.
– We make specific observations on the Website
Archivability characteristics of each CMS.
• Paper (under review):
– Web Content Management Systems Archivability,
Banos V., Manolopoulos Y., ADBIS 2015’
40

• Indicative results:
– Drupal has the third highest WA score (82.08%). It has
good overall performance and the only issue is the
existence of too many inline scripts per instance
(15.09).
– DotNetNuke has the second worst WA score in our
evaluation (77.2%). We suggest that they look into
their RSS feeds (13% Correct). and lacking HTTP
caching support (5%).
– Typo3 WA score is average (79%). It has the largest
number of invalid URLs per instance (12%).
42

8. Discussion &
Conclusions
43

Discussion and conclusions
44
• Introducing a new metric to quantify the previously
unquantifiable notion of WA is not an easy task.
• CLEAR+ and Website Archivability capture the core
aspects of a website crucial in diagnosing whether it
has the potential to be archived with correctness
and accuracy.
• Archiveready.com is a reference implementation of
the CLEAR+ method.
• Archiveready.com provides a REST API for 3rd parties.

Discussion and conclusions
45
1. Web professionals
- evaluate the archivability of their websites
in an easy but thorough way,
- become aware of web preservation concepts,
- embrace preservation-friendly practices.
2. Web archive operators
- make informed decisions on archiving websites,
- perform large scale website evaluations with ease,
- automate web archiving Quality Assurance,
- minimise wasted resources on problematic websites.
3. Academics
- teach students about web archiving.

THANK YOU
Visit: http://archiveready.com
Contact: vbanos@gmail.com
Learn More:
• Banos V., Manolopoulos Y.: A quantitative approach to
evaluate Website Archivability using the CLEAR+ method,
International Journal on Digital Libraries, 2015.
• Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a credible
method to evaluate website archivability, 10th International
Conference on Preservation of Digital Objects (iPRES’2013),
Lisbon, 2013.
ANY QUESTIONS?
46

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03

Recommended

Recommended

More Related Content

Similar to Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03

Similar to Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03 (20)

More from Vangelis Banos

More from Vangelis Banos (7)

Recently uploaded

Recently uploaded (20)

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03

Editor's Notes