SlideShare a Scribd company logo
1 of 22
The Theory and Practice
of Website Archivability
Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1
1Department of Informatics, Aristotle University, Thessaloniki , Greece
2University of Glasgow, United Kingdom
FROM CLEAR TO ARCHIVEREADY.COM
2
Table of Contents
1. Problem definition,
2. CLEAR: A Credible Live Method to
Evaluate Website Archivability,
3. Demo: http://archiveready.com/,
4. Future Work.
Problem definition
• Web content acquisition is a critical step in
the process of web archiving,
• Web bots face increasing difficulties in
harvesting websites,
• After web harvesting, archive administrators
review manually the content and endorse or
reject the harvested material,
• Key Problem: Web harvesting is automated
while Quality Assurance (QA) is manual.
3
Website
Archivability ?
What is
Website Archivability captures the core aspects
of a website crucial in diagnosing whether it has
the potentiality to be archived with
completeness and accuracy.
Attention! it must not be confused with website dependability,
reliability, availability, safety, security, survivability, maintainability.
CLEAR: A Credible Live Method to Evaluate
Website Archivability
• An approach to producing a credible on-the-fly
measurement of Website Archivability, by:
• Using standard HTTP to get website elements,
• Evaluating information such as file types, content
encoding and transfer errors,
• Combining this information with an evaluation of the
website's compliance with recognised practices in
digital curation,
• Using adopted standards, validating formats,
assigning metadata
• Calculating Website Archivability Score (0 – 100%)
5
6
Accessibility Cohesion
Standards
Compliance
Performance
Metadata
CLEAR: A Credible Live Method to Evaluate
Website Archivability
7
Website attributes evaluated using CLEAR
8
C L E A R
• The method can be summarised as follows:
1. Perform specific Evaluations on Website
Attributes,
2. In order to calculate each Archivability Facet’s
score,
• Scores range from (0 – 100%),
• Not all evaluations are equal, if an important
evaluation fails, score = 0, if a minor
evaluation fails, score = 50%
3. Producing the final Website Archivability as the
sum all Facets’ scores.
Accessibility
9
Are web archiving crawlers able to
discover all content using standard
protocols and best practices?
Accessibility evaluation
10
Facet Evaluation Rating Total
Accessibility
No RSS feed 50%
50%
No robots.txt 50%
No sitemap.xml 0%
6 links, all valid 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Cohesion
11
• Dependencies are a great issue in digital curation.
• If a website is dispersed across different web
locations (images, javascripts, CSS, CDNs, etc),
the acquisition and ingest is likely to risk suffering if
one or more web locations fail on change.
• Web bots may have issues accessing a lot of
different web locations due to configuration issues.
Cohesion evaluation
12
Facet Evaluation Rating Total
Cohesion
1 external and no internal scripts 0%
70%
4 local and 1 external images 80%
No proprietary (Quicktime & Flash)
files
100%
1 local CSS file 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Metadata
13
• Metadata are necessary for digital curation and
archiving.
• Lack of metadata impairs the ability to manage,
organise, retrieve and interact with content.
• Web content metadata may be:
• Syntactic: (e.g. content encoding, character set)
• Semantic: (e.g. description, keywords, dates)
• Pragmatic: (e.g. FOAF, RDF, Dublin Core)
Metadata evaluation
14
Facet Evaluation Rating Total
Metadata
Meta description found 100%
87%
HTTP Content type 100%
HTTP Page expiration not found 50%
HTTP Last-modified found 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Performance
15
• Calculate the average network response time for all
website content.
• The throughput of web spider data acquisition
affects the number and complexity of the web
sources it can process.
• Performance evaluation:
Facet Evaluation Rating Total
Performance Average network response
time is 0.546ms
100% 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Standards Compliance
16
• Digital curation best practices recommend that web
resources must be represented in known and
transparent standards, in order to be preserved.
Standards Compliance evaluation
17
Facet Evaluation Rating Total
Standards
Compliance
1 Invalid CSS file 0%
87%
Invalid HTML file 0%
Meta description found 100%
No HTTP Content encoding 50%
HTTP Content Type found 100%
HTTP Page expiration found 100%
HTTP Last-modified found 100%
No Quicktime or Flash objects 100%
5 images found and validated with JHOVE 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
iPRES 2013 Website Archivability
Evaluation
18
Facet Rating Website
Archivability
Accessibility 50%
77%
Cohesion 70%
Standards Compliance 77%
Metadata 87%
Performance 100%
ArchiveReady.com
Demonstration
- Web application implementing CLEAR,
- Web interface & also Web API in JSON,
- Running on Linux, Python, Nginx, Redis, Mysql.
19
Impact
20
1. Web professionals
- evaluate the archivability of their websites
in an easy but thorough way,
- become aware of web preservation concepts,
- embrace preservation-friendly practices.
2. Web archive operators
- make informed decisions on archiving websites,
- perform large scale website evaluations with ease,
- automate web archiving Quality Assurance,
- minimise wasted resources on problematic websites.
21
Future Work
1. Not optimal to treat all Archivability Facets as equal.
2. Evaluating a single website page, based on the
assumption that web pages from the same website
share the same components and standards.
Sampling would be necessary.
3. Certain classes and specific types of errors create
lesser or greater obstacles to website acquisition
and ingest than others. Differential valuing of error
classes and types is necessary.
4. Cross validation with web archive data is under way
THANK YOU
Vangelis Banos
Web: http://vbanos.gr/
Email: vbanos@gmail.com
ANY QUESTIONS?
22
The research leading to these results has
received funding from the European
Commission Framework Programme 7
(FP7), BlogForever project, grant
agreement No.269963.

More Related Content

Similar to The theory and practice of Website Archivability

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
Case Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformCase Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformMike Taylor
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
What’s Next with Accessibility?
What’s Next with Accessibility?What’s Next with Accessibility?
What’s Next with Accessibility?Keana Lynch
 
Managing Accessibility Compliance
Managing Accessibility ComplianceManaging Accessibility Compliance
Managing Accessibility ComplianceKeana Lynch
 
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...SOASTA
 
Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...IRJET Journal
 
3 (de 3). Evaluación de Accessibilidad Digital
3 (de 3).  Evaluación de Accessibilidad Digital3 (de 3).  Evaluación de Accessibilidad Digital
3 (de 3). Evaluación de Accessibilidad DigitalDCU_MPIUA
 
7 Section Website Assessment
7 Section Website Assessment 7 Section Website Assessment
7 Section Website Assessment Corey84
 
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital AccessibilityTCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital AccessibilityRaymond Rose
 
Automated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web ApplicationsAutomated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web ApplicationsLionel Briand
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Healthwellunwell
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Healthwellunwell
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Healthwellunwell
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Healthwellunwell
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Healthwellunwell
 

Similar to The theory and practice of Website Archivability (20)

The Accessible Web
The Accessible WebThe Accessible Web
The Accessible Web
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
IR-AUDIT
IR-AUDITIR-AUDIT
IR-AUDIT
 
Case Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformCase Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis Platform
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
What’s Next with Accessibility?
What’s Next with Accessibility?What’s Next with Accessibility?
What’s Next with Accessibility?
 
Managing Accessibility Compliance
Managing Accessibility ComplianceManaging Accessibility Compliance
Managing Accessibility Compliance
 
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
 
Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...
 
3 (de 3). Evaluación de Accessibilidad Digital
3 (de 3).  Evaluación de Accessibilidad Digital3 (de 3).  Evaluación de Accessibilidad Digital
3 (de 3). Evaluación de Accessibilidad Digital
 
7 Section Website Assessment
7 Section Website Assessment 7 Section Website Assessment
7 Section Website Assessment
 
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital AccessibilityTCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital Accessibility
 
Automated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web ApplicationsAutomated Inference of Access Control Policies for Web Applications
Automated Inference of Access Control Policies for Web Applications
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Health
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Health
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Health
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Health
 
Funcsp Open Health
Funcsp Open HealthFuncsp Open Health
Funcsp Open Health
 

More from Vangelis Banos

ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγειαVangelis Banos
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςVangelis Banos
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςVangelis Banos
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeVangelis Banos
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...Vangelis Banos
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήVangelis Banos
 

More from Vangelis Banos (7)

ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγεια
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της Μετρολογίας
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challenge
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
 

Recently uploaded

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 

Recently uploaded (20)

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

The theory and practice of Website Archivability

  • 1. The Theory and Practice of Website Archivability Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1 1Department of Informatics, Aristotle University, Thessaloniki , Greece 2University of Glasgow, United Kingdom FROM CLEAR TO ARCHIVEREADY.COM
  • 2. 2 Table of Contents 1. Problem definition, 2. CLEAR: A Credible Live Method to Evaluate Website Archivability, 3. Demo: http://archiveready.com/, 4. Future Work.
  • 3. Problem definition • Web content acquisition is a critical step in the process of web archiving, • Web bots face increasing difficulties in harvesting websites, • After web harvesting, archive administrators review manually the content and endorse or reject the harvested material, • Key Problem: Web harvesting is automated while Quality Assurance (QA) is manual. 3
  • 4. Website Archivability ? What is Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.
  • 5. CLEAR: A Credible Live Method to Evaluate Website Archivability • An approach to producing a credible on-the-fly measurement of Website Archivability, by: • Using standard HTTP to get website elements, • Evaluating information such as file types, content encoding and transfer errors, • Combining this information with an evaluation of the website's compliance with recognised practices in digital curation, • Using adopted standards, validating formats, assigning metadata • Calculating Website Archivability Score (0 – 100%) 5
  • 6. 6 Accessibility Cohesion Standards Compliance Performance Metadata CLEAR: A Credible Live Method to Evaluate Website Archivability
  • 8. 8 C L E A R • The method can be summarised as follows: 1. Perform specific Evaluations on Website Attributes, 2. In order to calculate each Archivability Facet’s score, • Scores range from (0 – 100%), • Not all evaluations are equal, if an important evaluation fails, score = 0, if a minor evaluation fails, score = 50% 3. Producing the final Website Archivability as the sum all Facets’ scores.
  • 9. Accessibility 9 Are web archiving crawlers able to discover all content using standard protocols and best practices?
  • 10. Accessibility evaluation 10 Facet Evaluation Rating Total Accessibility No RSS feed 50% 50% No robots.txt 50% No sitemap.xml 0% 6 links, all valid 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 11. Cohesion 11 • Dependencies are a great issue in digital curation. • If a website is dispersed across different web locations (images, javascripts, CSS, CDNs, etc), the acquisition and ingest is likely to risk suffering if one or more web locations fail on change. • Web bots may have issues accessing a lot of different web locations due to configuration issues.
  • 12. Cohesion evaluation 12 Facet Evaluation Rating Total Cohesion 1 external and no internal scripts 0% 70% 4 local and 1 external images 80% No proprietary (Quicktime & Flash) files 100% 1 local CSS file 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 13. Metadata 13 • Metadata are necessary for digital curation and archiving. • Lack of metadata impairs the ability to manage, organise, retrieve and interact with content. • Web content metadata may be: • Syntactic: (e.g. content encoding, character set) • Semantic: (e.g. description, keywords, dates) • Pragmatic: (e.g. FOAF, RDF, Dublin Core)
  • 14. Metadata evaluation 14 Facet Evaluation Rating Total Metadata Meta description found 100% 87% HTTP Content type 100% HTTP Page expiration not found 50% HTTP Last-modified found 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 15. Performance 15 • Calculate the average network response time for all website content. • The throughput of web spider data acquisition affects the number and complexity of the web sources it can process. • Performance evaluation: Facet Evaluation Rating Total Performance Average network response time is 0.546ms 100% 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 16. Standards Compliance 16 • Digital curation best practices recommend that web resources must be represented in known and transparent standards, in order to be preserved.
  • 17. Standards Compliance evaluation 17 Facet Evaluation Rating Total Standards Compliance 1 Invalid CSS file 0% 87% Invalid HTML file 0% Meta description found 100% No HTTP Content encoding 50% HTTP Content Type found 100% HTTP Page expiration found 100% HTTP Last-modified found 100% No Quicktime or Flash objects 100% 5 images found and validated with JHOVE 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 18. iPRES 2013 Website Archivability Evaluation 18 Facet Rating Website Archivability Accessibility 50% 77% Cohesion 70% Standards Compliance 77% Metadata 87% Performance 100%
  • 19. ArchiveReady.com Demonstration - Web application implementing CLEAR, - Web interface & also Web API in JSON, - Running on Linux, Python, Nginx, Redis, Mysql. 19
  • 20. Impact 20 1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices. 2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites.
  • 21. 21 Future Work 1. Not optimal to treat all Archivability Facets as equal. 2. Evaluating a single website page, based on the assumption that web pages from the same website share the same components and standards. Sampling would be necessary. 3. Certain classes and specific types of errors create lesser or greater obstacles to website acquisition and ingest than others. Differential valuing of error classes and types is necessary. 4. Cross validation with web archive data is under way
  • 22. THANK YOU Vangelis Banos Web: http://vbanos.gr/ Email: vbanos@gmail.com ANY QUESTIONS? 22 The research leading to these results has received funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.

Editor's Notes

  1. Abstract: Web archiving is crucial to ensure that cultural, scientificand social heritage on the web remains accessible and usableover time. A key aspect of the web archiving process is opti-mal data extraction from target websites. This procedure isdifficult for such reasons as, website complexity, plethora ofunderlying technologies and ultimately the open-ended na-ture of the web. The purpose of this work is to establishthe notion of Website Archivability (WA) and to introducethe Credible Live Evaluation of Archive Readiness (CLEAR)method to measureWA for any website. Website Archivabil-ity captures the core aspects of a website crucial in diagnos-ing whether it has the potentiality to be archived with com-pleteness and accuracy. An appreciation of the archivabilityof a web site should provide archivists with a valuable toolwhen assessing the possibilities of archiving material and in-fluence web design professionals to consider the implicationsof their design decisions on the likelihood could be archived.A prototype application, archiveready.com, has been estab-lished to demonstrate the viabiity of the proposed methodfor assessing Website Archivability.
  2. Web content acquisition is a critical step in the process of web archiving;If the initial Submission Information Package lacks completeness and accuracy for any reason (e.g. missing or invalid web content), the rest of the preservation processes are rendered useless;There is no guarantee that web bots dedicated to retrieving website content can access and retrieve it successfully;Web bots face increasing difficulties in harvesting websites.Efforts to deploy crowdsourced techniques to manage QA provide an indication of how significant the bottleneck is.Dirty data -> useless systemAs websites become more sophisticated and complex, the difficulties that web bots face in harvesting them increase.For instance, some web bots have limited abilities to process GIS les, dynamic web content, or streaming media [16]. Toovercome these obstacles, standards have been developed to make websites more amenable to harvesting by web bots.Two examples are the Sitemaps.xml and Robots.txt protocols. Such protocols are not used universally.
  3. Website archivability must not be confused with website dependability, the former refers to the ability to archive a website while the latter is a system property that integrates such attributes as reliability, availability, safety, security, survivability and maintainability[1].Support web archivists in decision making, in order to improve the quality of web archives.Expand and optimize the knowledge and practices of web archivists.Standardize the web aggregation practices of web archives, especially QA.Foster good practices in web development, make sites more amenable to harvesting, ingesting, and preserving.Raise awareness among web professionals regarding preservation.
  4. The concept of CLEAR emerged from our current research in web preservation in the context of the BlogForever project which involves weblog harvesting and archiving. Our work revealed the need for a method to assess website archive readiness in order to support web archiving workflows.
  5. Already contacted by the following institutionsThe Internet Archive,University of Manchester,Columbia University Libraries,Society of California Archivists General Assembly,Old Dominion University, Virginia, USA,Digital Archivists in Netherlands.
  6. For instance Metadata breadth and depth might be critical for a particular web archiving research task andtherefore in establishing the archivability score for a particular site the user may which to instantiate this thinking incalculating the overall score. A next step will be to introduce a mechanism to allow the user to weight each Archivability Facet to reflect specific objectives.One way to address these concerns might be to apply an ap-proach similar to normalized discounted cummulative gain(NDCG) in information retrieval49: for example, a user canrank the questions/errors to prioritise them for each facet.The basic archivability score can be adjusted to penalise theoutcome when the website does not meet the higher rankedcriteria. Further experimentation with the tool will lead toa richer understanding of new directions in automation inweb archiving.