The theory and practice of Website Archivability

The Theory and Practice
of Website Archivability
Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1
1Department of Informatics, Aristotle University, Thessaloniki , Greece
2University of Glasgow, United Kingdom
FROM CLEAR TO ARCHIVEREADY.COM

2
Table of Contents
1. Problem definition,
2. CLEAR: A Credible Live Method to
Evaluate Website Archivability,
3. Demo: http://archiveready.com/,
4. Future Work.

Problem definition
• Web content acquisition is a critical step in
the process of web archiving,
• Web bots face increasing difficulties in
harvesting websites,
• After web harvesting, archive administrators
review manually the content and endorse or
reject the harvested material,
• Key Problem: Web harvesting is automated
while Quality Assurance (QA) is manual.
3

Website
Archivability ?
What is
Website Archivability captures the core aspects
of a website crucial in diagnosing whether it has
the potentiality to be archived with
completeness and accuracy.
Attention! it must not be confused with website dependability,
reliability, availability, safety, security, survivability, maintainability.

CLEAR: A Credible Live Method to Evaluate
Website Archivability
• An approach to producing a credible on-the-fly
measurement of Website Archivability, by:
• Using standard HTTP to get website elements,
• Evaluating information such as file types, content
encoding and transfer errors,
• Combining this information with an evaluation of the
website's compliance with recognised practices in
digital curation,
• Using adopted standards, validating formats,
assigning metadata
• Calculating Website Archivability Score (0 – 100%)
5

6
Accessibility Cohesion
Standards
Compliance
Performance
Metadata
CLEAR: A Credible Live Method to Evaluate
Website Archivability

7
Website attributes evaluated using CLEAR

8
C L E A R
• The method can be summarised as follows:
1. Perform specific Evaluations on Website
Attributes,
2. In order to calculate each Archivability Facet’s
score,
• Scores range from (0 – 100%),
• Not all evaluations are equal, if an important
evaluation fails, score = 0, if a minor
evaluation fails, score = 50%
3. Producing the final Website Archivability as the
sum all Facets’ scores.

Accessibility
9
Are web archiving crawlers able to
discover all content using standard
protocols and best practices?

Accessibility evaluation
10
Facet Evaluation Rating Total
Accessibility
No RSS feed 50%
50%
No robots.txt 50%
No sitemap.xml 0%
6 links, all valid 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Cohesion
11
• Dependencies are a great issue in digital curation.
• If a website is dispersed across different web
locations (images, javascripts, CSS, CDNs, etc),
the acquisition and ingest is likely to risk suffering if
one or more web locations fail on change.
• Web bots may have issues accessing a lot of
different web locations due to configuration issues.

Cohesion evaluation
12
Cohesion
1 external and no internal scripts 0%
70%
4 local and 1 external images 80%
No proprietary (Quicktime & Flash)
files
100%
1 local CSS file 100%

Metadata
13
• Metadata are necessary for digital curation and
archiving.
• Lack of metadata impairs the ability to manage,
organise, retrieve and interact with content.
• Web content metadata may be:
• Syntactic: (e.g. content encoding, character set)
• Semantic: (e.g. description, keywords, dates)
• Pragmatic: (e.g. FOAF, RDF, Dublin Core)

Metadata evaluation
14
Metadata
Meta description found 100%
87%
HTTP Content type 100%
HTTP Page expiration not found 50%
HTTP Last-modified found 100%

Performance
15
• Calculate the average network response time for all
website content.
• The throughput of web spider data acquisition
affects the number and complexity of the web
sources it can process.
• Performance evaluation:
Performance Average network response
time is 0.546ms
100% 100%

Standards Compliance
16
• Digital curation best practices recommend that web
resources must be represented in known and
transparent standards, in order to be preserved.

Standards Compliance evaluation
17
Standards
Compliance
1 Invalid CSS file 0%
87%
Invalid HTML file 0%
Meta description found 100%
No HTTP Content encoding 50%
HTTP Content Type found 100%
HTTP Page expiration found 100%
HTTP Last-modified found 100%
No Quicktime or Flash objects 100%
5 images found and validated with JHOVE 100%

iPRES 2013 Website Archivability
Evaluation
18
Facet Rating Website
Archivability
Accessibility 50%
77%
Cohesion 70%
Standards Compliance 77%
Metadata 87%
Performance 100%

ArchiveReady.com
Demonstration
- Web application implementing CLEAR,
- Web interface & also Web API in JSON,
- Running on Linux, Python, Nginx, Redis, Mysql.
19

Impact
20
1. Web professionals
- evaluate the archivability of their websites
in an easy but thorough way,
- become aware of web preservation concepts,
- embrace preservation-friendly practices.
2. Web archive operators
- make informed decisions on archiving websites,
- perform large scale website evaluations with ease,
- automate web archiving Quality Assurance,
- minimise wasted resources on problematic websites.

21
Future Work
1. Not optimal to treat all Archivability Facets as equal.
2. Evaluating a single website page, based on the
assumption that web pages from the same website
share the same components and standards.
Sampling would be necessary.
3. Certain classes and specific types of errors create
lesser or greater obstacles to website acquisition
and ingest than others. Differential valuing of error
classes and types is necessary.
4. Cross validation with web archive data is under way

THANK YOU
Vangelis Banos
Web: http://vbanos.gr/
Email: vbanos@gmail.com
ANY QUESTIONS?
22
The research leading to these results has
received funding from the European
Commission Framework Programme 7
(FP7), BlogForever project, grant
agreement No.269963.

The theory and practice of Website Archivability

Recommended

Recommended

More Related Content

Similar to The theory and practice of Website Archivability

Similar to The theory and practice of Website Archivability (20)

More from Vangelis Banos

More from Vangelis Banos (7)

Recently uploaded

Recently uploaded (20)

The theory and practice of Website Archivability

Editor's Notes