SlideShare a Scribd company logo
CLEAR: a Credible Live
Evaluation Method of
Website Archivability
Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1
1Department of Informatics, Aristotle University, Thessaloniki , Greece
2University of Glasgow, United Kingdom
ARCHIVEREADY.COM
2
Table of Contents
1. Problem definition and related work,
2. Our contributions,
3. Website Archivability,
4. CLEAR: A Credible Live Method to
Evaluate Website Archivability,
5. Demonstration: http://archiveready.com/,
6. Limitations and Future Work.
Problem definition
• Web content acquisition is a critical step in the
process of web archiving;
• If the initial Submission Information Package lacks
completeness and accuracy for any reason (e.g.
missing or invalid web content), the rest of the
preservation processes are rendered useless;
• There is no guarantee that web bots dedicated to
retrieving website content can access and retrieve
it successfully;
• Web bots face increasing difficulties in
harvesting websites.
3
4
• After web harvesting, administrators review
manually the content and endorse or reject the
harvested material.
• Web harvesting is automated while Quality
Assurance (QA) is manual.
• Efforts to deploy crowdsourced techniques to
manage QA provide an indication of how significant
the bottleneck is.
Problem definition
Inspired by our work at
5
There is a need for a method to assess
website archive readiness in order to
support web archiving workflow.
building a blog preservation software platform
http://blogforever.eu
6
1. the introduction of the notion of Website
Archivability,
2. the definition of the Credible Live
Evaluation of Archive Readiness
(CLEAR) method to measure Website
Archivability
3. ArchiveReady.com, a web application
which implements the proposed method.
Our Contributions
7
1. Mechanism to improve the quality of web archives.
2. Expand and optimize the knowledge and practices of
web archivists, supporting them in their decision
making, and risk management.
3. Standardize the web aggregation practices of web
archives, especially QA.
4. Foster good practices in web development, make
sites more amenable to harvesting, ingesting, and
preserving.
5. Raise awareness among web professionals regarding
preservation.
Our Aims
Website
Archivability ?
What is
Website Archivability captures the core aspects
of a website crucial in diagnosing whether it has
the potentiality to be archived with
completeness and accuracy.
Attention! it must not be confused with website
dependability, reliability, availability, safety, security, survivability,
maintainability.
CLEAR: A Credible Live Method to Evaluate
Website Archivability
• An approach to producing on-the-fly measurement
of Website Archivability,
• Web archives communicate with target websites via
standard HTTP,
• Information such as file types, content and transfer
errors could be used to support archival decisions,
• We combine this kind of information with an
evaluation of the website's compliance with
recognised practices in digital curation,
• We generate a credible score representing the
archivability of target websites.
9
10
Accessibility Cohesion
Standards
Compliance
Performance
Metadata
CLEAR: A Credible Live Method to Evaluate
Website Archivability
11
Website attributes evaluated using CLEAR
12
C L E A R
• The method can be summarised as follows:
1. Perform specific Evaluations on Website
Attributes,
2. In order to calculate each Archivability Facet’s
score,
• Scores range from (0 – 100%),
• Not all evaluations are equal, if an important
evaluation fails, score = 0, if a minor
evaluation fails, score = 50%
3. Producing the final Website Archivability as the
sum all Facets’ scores.
Accessibility
13
Accessibility
• A website is considered accessible only if web
crawlers are able to visit its home page, traverse its
content and retrieve it via standard HTTP requests.
14
Accessibility
15
Facet Evaluation Rating Total
Accessibility
No RSS feed 50%
50%
No robots.txt 50%
No sitemap.xml 0%
6 links, all valid 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Cohesion
16
Cohesion
• Relevant to:
• Efficient operation of web crawlers,
• Management of dependancies with digital
curation.
• If files constituting a single website are dispersed
across different web locations, the acquisition and
ingest is likely to risk suffering if one or more web
locations fail.
• Changes that occur outside the website are not
going to affect it if it does not use 3rd party
resources.
17
Cohesion
18
Facet Evaluation Rating Total
Cohesion
1 external and no internal scripts 0%
70%
4 local and 1 external images 80%
No proprietary (Quicktime & Flash)
files
100%
1 local CSS file 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Metadata
19
Metadata
• The adequate
provision of metadata
has been a
continuing concern
within digital curation.
• The lack of metadata
impairs the archive’s
ability to
manage, organise, re
trieve and interact
with content
effectively.
20
Metadata
21
Facet Evaluation Rating Total
Metadata
Meta description found 100%
87%
HTTP Content type 100%
HTTP Page expiration not found 50%
HTTP Last-modified found 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Performance
22
Performance
Performance is an important aspect of web archiving.
The throughput of data acquisition of a web spider
directly affects the number and complexity of web
resources it is able to process.
23
Facet Evaluation Rating Total
Performance Average network response
time is 0.546ms
100% 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
Standards
Compliance
24
Standards Compliance
• Compliance with standards is a recurring theme in
digital curation practices. It is recommended that for
digital resources to be preserved they need to be
represented in known and transparent standards.
25
Standards Compliance
26
Facet Evaluation Rating Total
Standards
Compliance
1 Invalid CSS file 0%
87%
Invalid HTML file 0%
Meta description found 100%
No HTTP Content encoding 50%
HTTP Content Type found 100%
HTTP Page expiration found 100%
HTTP Last-modified found 100%
No Quicktime or Flash objects 100%
5 images found and validated with JHOVE 100%
http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
iPRES 2013 Website Archivability Evaluation
27
Facet Rating Website
Archivability
Accessibility 50%
77%
Cohesion 70%
Standards Compliance 77%
Metadata 87%
Performance 100%
ArchiveReady.com
Demonstration
- Web application implementing CLEAR,
- Web interface & also Web API in JSON,
- Running on Linux, Python, Nginx, Redis, Mysql.
28
29
Impact
30
1. Web professionals
- evaluate the archivability of their websites
in an easy but thorough way,
- become aware of web preservation concepts,
- embrace preservation-friendly practices.
2. Web archive operators
- make informed decisions on archiving websites,
- perform large scale website evaluations with ease,
- automate web archiving Quality Assurance,
- minimise wasted resources on problematic websites.
31
Limitations & Future Work
1. Not optimal to treat all Archivability Facets as equal.
2. Evaluating a single website page, based on the
assumption that web pages from the same website
share the same components and standards.
Sampling would be necessary.
3. Certain classes and specific types of errors create
lesser or greater obstacles to website acquisition
and ingest than others. The method needs to be
enhanced to reflect this differential valuing of error
classes and types.
THANK YOU
Vangelis Banos
Web: http://vbanos.gr/
Email: vbanos@gmail.com
ANY QUESTIONS?
32
The research leading to these results has
received funding from the European
Commission Framework Programme 7
(FP7), BlogForever project, grant
agreement No.269963.

More Related Content

What's hot

Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsisMayur Garg
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerAkshay Pratap Singh
 
Coding for a wget based Web Crawler
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web CrawlerSanchit Saini
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
 
REST and ASP.NET Web API (Tunisia)
REST and ASP.NET Web API (Tunisia)REST and ASP.NET Web API (Tunisia)
REST and ASP.NET Web API (Tunisia)Jef Claes
 
REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)Jef Claes
 
Crawler-Friendly Web Servers
Crawler-Friendly Web ServersCrawler-Friendly Web Servers
Crawler-Friendly Web Serverswebhostingguy
 
What is a web crawler and how does it work
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it workSwati Sharma
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Sanchit Saini
 
REST Methodologies
REST MethodologiesREST Methodologies
REST Methodologiesjrodbx
 
HTML5 Offline Web Application
HTML5 Offline Web ApplicationHTML5 Offline Web Application
HTML5 Offline Web ApplicationAllan Huang
 
The ASP.NET Web API for Beginners
The ASP.NET Web API for BeginnersThe ASP.NET Web API for Beginners
The ASP.NET Web API for BeginnersKevin Hazzard
 
Website optimization with request reduce
Website optimization with request reduceWebsite optimization with request reduce
Website optimization with request reduceMatt Wrock
 
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlusLeveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlusMyka Kennedy Stephens
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
 

What's hot (19)

Web API Basics
Web API BasicsWeb API Basics
Web API Basics
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Coding for a wget based Web Crawler
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web Crawler
 
Webscripts
WebscriptsWebscripts
Webscripts
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
REST and ASP.NET Web API (Tunisia)
REST and ASP.NET Web API (Tunisia)REST and ASP.NET Web API (Tunisia)
REST and ASP.NET Web API (Tunisia)
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)REST and ASP.NET Web API (Milan)
REST and ASP.NET Web API (Milan)
 
Crawler-Friendly Web Servers
Crawler-Friendly Web ServersCrawler-Friendly Web Servers
Crawler-Friendly Web Servers
 
What is a web crawler and how does it work
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it work
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
REST Methodologies
REST MethodologiesREST Methodologies
REST Methodologies
 
HTML5 Offline Web Application
HTML5 Offline Web ApplicationHTML5 Offline Web Application
HTML5 Offline Web Application
 
The ASP.NET Web API for Beginners
The ASP.NET Web API for BeginnersThe ASP.NET Web API for Beginners
The ASP.NET Web API for Beginners
 
Website optimization with request reduce
Website optimization with request reduceWebsite optimization with request reduce
Website optimization with request reduce
 
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlusLeveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 

Similar to CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website ArchivabilityVangelis Banos
 
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Vangelis Banos
 
What’s Next with Accessibility?
What’s Next with Accessibility?What’s Next with Accessibility?
What’s Next with Accessibility?Keana Lynch
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
Case Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformCase Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformMike Taylor
 
WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...
WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...
WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...madhurpatidar2
 
introduction to web engineering.pptx
introduction to web engineering.pptxintroduction to web engineering.pptx
introduction to web engineering.pptxNaglaaFathy42
 
IWMW 2002: QA for the IWMW Web Site
IWMW 2002: QA for the IWMW Web SiteIWMW 2002: QA for the IWMW Web Site
IWMW 2002: QA for the IWMW Web SiteIWMW
 
Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingKristen Yarmey
 
introduction to web engineering.pdf
introduction to web engineering.pdfintroduction to web engineering.pdf
introduction to web engineering.pdfNaglaaFathy42
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...
KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...
KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...andinieldananty
 
Techniques for scaling application with security and visibility in cloud
Techniques for scaling application with security and visibility in cloudTechniques for scaling application with security and visibility in cloud
Techniques for scaling application with security and visibility in cloudAkshay Mathur
 
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital AccessibilityTCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital AccessibilityRaymond Rose
 
Content Management Systems: An Executive Review
Content Management Systems: An Executive ReviewContent Management Systems: An Executive Review
Content Management Systems: An Executive ReviewWilliam Price
 
SharePoint 2013 governance model
SharePoint 2013 governance modelSharePoint 2013 governance model
SharePoint 2013 governance modelYash Goley
 
Tips for Keeping Your Website Healthy.pptx
Tips for Keeping Your Website Healthy.pptxTips for Keeping Your Website Healthy.pptx
Tips for Keeping Your Website Healthy.pptxskaditsolutionsdubai
 
Quality management in continuous delivery and dev ops world pm footprints v1
Quality management in continuous delivery and dev ops world  pm footprints v1Quality management in continuous delivery and dev ops world  pm footprints v1
Quality management in continuous delivery and dev ops world pm footprints v1Dr. Anish Cheriyan (PhD)
 

Similar to CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013 (20)

The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website Archivability
 
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
 
What’s Next with Accessibility?
What’s Next with Accessibility?What’s Next with Accessibility?
What’s Next with Accessibility?
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
Case Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformCase Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis Platform
 
IR-AUDIT
IR-AUDITIR-AUDIT
IR-AUDIT
 
WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...
WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...
WINSEM2021-22_ITE2004_ETH_VL2021220500452_Reference_Material_I_26-04-2022_tes...
 
introduction to web engineering.pptx
introduction to web engineering.pptxintroduction to web engineering.pptx
introduction to web engineering.pptx
 
IWMW 2002: QA for the IWMW Web Site
IWMW 2002: QA for the IWMW Web SiteIWMW 2002: QA for the IWMW Web Site
IWMW 2002: QA for the IWMW Web Site
 
Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web Archiving
 
introduction to web engineering.pdf
introduction to web engineering.pdfintroduction to web engineering.pdf
introduction to web engineering.pdf
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
The Accessible Web
The Accessible WebThe Accessible Web
The Accessible Web
 
KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...
KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...
KMWorld 2010_Building an Intranet Governance Strategy - Busch and Wahl_201011...
 
Techniques for scaling application with security and visibility in cloud
Techniques for scaling application with security and visibility in cloudTechniques for scaling application with security and visibility in cloud
Techniques for scaling application with security and visibility in cloud
 
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital AccessibilityTCEA Virtual Learning SIG  Lunch and Learn: Understanding Digital Accessibility
TCEA Virtual Learning SIG Lunch and Learn: Understanding Digital Accessibility
 
Content Management Systems: An Executive Review
Content Management Systems: An Executive ReviewContent Management Systems: An Executive Review
Content Management Systems: An Executive Review
 
SharePoint 2013 governance model
SharePoint 2013 governance modelSharePoint 2013 governance model
SharePoint 2013 governance model
 
Tips for Keeping Your Website Healthy.pptx
Tips for Keeping Your Website Healthy.pptxTips for Keeping Your Website Healthy.pptx
Tips for Keeping Your Website Healthy.pptx
 
Quality management in continuous delivery and dev ops world pm footprints v1
Quality management in continuous delivery and dev ops world  pm footprints v1Quality management in continuous delivery and dev ops world  pm footprints v1
Quality management in continuous delivery and dev ops world pm footprints v1
 

More from Vangelis Banos

Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναΥπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναVangelis Banos
 
Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Vangelis Banos
 
Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Vangelis Banos
 
ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγειαVangelis Banos
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςVangelis Banos
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςVangelis Banos
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeVangelis Banos
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...Vangelis Banos
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήVangelis Banos
 

More from Vangelis Banos (10)

Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναΥπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
 
Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!
 
Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!
 
ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγεια
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της Μετρολογίας
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challenge
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

  • 1. CLEAR: a Credible Live Evaluation Method of Website Archivability Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1 1Department of Informatics, Aristotle University, Thessaloniki , Greece 2University of Glasgow, United Kingdom ARCHIVEREADY.COM
  • 2. 2 Table of Contents 1. Problem definition and related work, 2. Our contributions, 3. Website Archivability, 4. CLEAR: A Credible Live Method to Evaluate Website Archivability, 5. Demonstration: http://archiveready.com/, 6. Limitations and Future Work.
  • 3. Problem definition • Web content acquisition is a critical step in the process of web archiving; • If the initial Submission Information Package lacks completeness and accuracy for any reason (e.g. missing or invalid web content), the rest of the preservation processes are rendered useless; • There is no guarantee that web bots dedicated to retrieving website content can access and retrieve it successfully; • Web bots face increasing difficulties in harvesting websites. 3
  • 4. 4 • After web harvesting, administrators review manually the content and endorse or reject the harvested material. • Web harvesting is automated while Quality Assurance (QA) is manual. • Efforts to deploy crowdsourced techniques to manage QA provide an indication of how significant the bottleneck is. Problem definition
  • 5. Inspired by our work at 5 There is a need for a method to assess website archive readiness in order to support web archiving workflow. building a blog preservation software platform http://blogforever.eu
  • 6. 6 1. the introduction of the notion of Website Archivability, 2. the definition of the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure Website Archivability 3. ArchiveReady.com, a web application which implements the proposed method. Our Contributions
  • 7. 7 1. Mechanism to improve the quality of web archives. 2. Expand and optimize the knowledge and practices of web archivists, supporting them in their decision making, and risk management. 3. Standardize the web aggregation practices of web archives, especially QA. 4. Foster good practices in web development, make sites more amenable to harvesting, ingesting, and preserving. 5. Raise awareness among web professionals regarding preservation. Our Aims
  • 8. Website Archivability ? What is Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.
  • 9. CLEAR: A Credible Live Method to Evaluate Website Archivability • An approach to producing on-the-fly measurement of Website Archivability, • Web archives communicate with target websites via standard HTTP, • Information such as file types, content and transfer errors could be used to support archival decisions, • We combine this kind of information with an evaluation of the website's compliance with recognised practices in digital curation, • We generate a credible score representing the archivability of target websites. 9
  • 10. 10 Accessibility Cohesion Standards Compliance Performance Metadata CLEAR: A Credible Live Method to Evaluate Website Archivability
  • 12. 12 C L E A R • The method can be summarised as follows: 1. Perform specific Evaluations on Website Attributes, 2. In order to calculate each Archivability Facet’s score, • Scores range from (0 – 100%), • Not all evaluations are equal, if an important evaluation fails, score = 0, if a minor evaluation fails, score = 50% 3. Producing the final Website Archivability as the sum all Facets’ scores.
  • 14. Accessibility • A website is considered accessible only if web crawlers are able to visit its home page, traverse its content and retrieve it via standard HTTP requests. 14
  • 15. Accessibility 15 Facet Evaluation Rating Total Accessibility No RSS feed 50% 50% No robots.txt 50% No sitemap.xml 0% 6 links, all valid 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 17. Cohesion • Relevant to: • Efficient operation of web crawlers, • Management of dependancies with digital curation. • If files constituting a single website are dispersed across different web locations, the acquisition and ingest is likely to risk suffering if one or more web locations fail. • Changes that occur outside the website are not going to affect it if it does not use 3rd party resources. 17
  • 18. Cohesion 18 Facet Evaluation Rating Total Cohesion 1 external and no internal scripts 0% 70% 4 local and 1 external images 80% No proprietary (Quicktime & Flash) files 100% 1 local CSS file 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 20. Metadata • The adequate provision of metadata has been a continuing concern within digital curation. • The lack of metadata impairs the archive’s ability to manage, organise, re trieve and interact with content effectively. 20
  • 21. Metadata 21 Facet Evaluation Rating Total Metadata Meta description found 100% 87% HTTP Content type 100% HTTP Page expiration not found 50% HTTP Last-modified found 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 23. Performance Performance is an important aspect of web archiving. The throughput of data acquisition of a web spider directly affects the number and complexity of web resources it is able to process. 23 Facet Evaluation Rating Total Performance Average network response time is 0.546ms 100% 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 25. Standards Compliance • Compliance with standards is a recurring theme in digital curation practices. It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards. 25
  • 26. Standards Compliance 26 Facet Evaluation Rating Total Standards Compliance 1 Invalid CSS file 0% 87% Invalid HTML file 0% Meta description found 100% No HTTP Content encoding 50% HTTP Content Type found 100% HTTP Page expiration found 100% HTTP Last-modified found 100% No Quicktime or Flash objects 100% 5 images found and validated with JHOVE 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 27. iPRES 2013 Website Archivability Evaluation 27 Facet Rating Website Archivability Accessibility 50% 77% Cohesion 70% Standards Compliance 77% Metadata 87% Performance 100%
  • 28. ArchiveReady.com Demonstration - Web application implementing CLEAR, - Web interface & also Web API in JSON, - Running on Linux, Python, Nginx, Redis, Mysql. 28
  • 29. 29
  • 30. Impact 30 1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices. 2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites.
  • 31. 31 Limitations & Future Work 1. Not optimal to treat all Archivability Facets as equal. 2. Evaluating a single website page, based on the assumption that web pages from the same website share the same components and standards. Sampling would be necessary. 3. Certain classes and specific types of errors create lesser or greater obstacles to website acquisition and ingest than others. The method needs to be enhanced to reflect this differential valuing of error classes and types.
  • 32. THANK YOU Vangelis Banos Web: http://vbanos.gr/ Email: vbanos@gmail.com ANY QUESTIONS? 32 The research leading to these results has received funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.

Editor's Notes

  1. Abstract: Web archiving is crucial to ensure that cultural, scientificand social heritage on the web remains accessible and usableover time. A key aspect of the web archiving process is opti-mal data extraction from target websites. This procedure isdifficult for such reasons as, website complexity, plethora ofunderlying technologies and ultimately the open-ended na-ture of the web. The purpose of this work is to establishthe notion of Website Archivability (WA) and to introducethe Credible Live Evaluation of Archive Readiness (CLEAR)method to measureWA for any website. Website Archivabil-ity captures the core aspects of a website crucial in diagnos-ing whether it has the potentiality to be archived with com-pleteness and accuracy. An appreciation of the archivabilityof a web site should provide archivists with a valuable toolwhen assessing the possibilities of archiving material and in-fluence web design professionals to consider the implicationsof their design decisions on the likelihood could be archived.A prototype application, archiveready.com, has been estab-lished to demonstrate the viabiity of the proposed methodfor assessing Website Archivability.
  2. Dirty data -> useless systemAs websites become more sophisticated and complex, the difficulties that web bots face in harvesting them increase.For instance, some web bots have limited abilities to process GIS les, dynamic web content, or streaming media [16]. Toovercome these obstacles, standards have been developed to make websites more amenable to harvesting by web bots.Two examples are the Sitemaps.xml and Robots.txt protocols. Such protocols are not used universally.
  3. According to the web archiving process followed by the National Library of New Zealand, after performing the harvests, the operators review and endorse or reject the harvested material; accepted material is then deposited in the repository.WCT supports such web archiving processes as permissions, job scheduling, harvesting, quality review, and the collection ofdescriptive metadata. Focusing on quality review, when a harvest is complete, the harvest result is saved in the digital asset store, and the Target Instance is saved in the Harvested state. The next step is for the Target Instance Owner to Quality Review the harvest. WCT operators perform this task manually.E.g. IIPC has organized a Crowdsourcing workshop which included a QA task
  4. Website archivability must not be confused with website dependability, the former refers to the ability to archive a website while the latter is a system property that integrates such attributes as reliability, availability, safety, security, survivability and maintainability[1].
  5. The concept of CLEAR emerged from our current research in web preservation in the context of the BlogForever project which involves weblog harvesting and archiving. Our work revealed the need for a method to assess website archive readiness in order to support web archiving workflows.
  6. Cohesion is tested on three levels:• examining how many hosts are employed in relation to the location of referenced media content,• examining how many hosts are employed in relation to supporting resources (e.g. robots.txt, sitemap.xml,and javascripts),• examining the number of times proprietary software or plugins are referenced.
  7. Already contacted by the following institutionsThe Internet Archive,University of Manchester,Columbia University Libraries,Society of California Archivists General Assembly,Old Dominion University, Virginia, USA,Digital Archivists in Netherlands.
  8. For instance Metadata breadth and depth might be critical for a particular web archiving research task andtherefore in establishing the archivability score for a particular site the user may which to instantiate this thinking incalculating the overall score. A next step will be to introduce a mechanism to allow the user to weight each Archivability Facet to reflect specific objectives.One way to address these concerns might be to apply an ap-proach similar to normalized discounted cummulative gain(NDCG) in information retrieval49: for example, a user canrank the questions/errors to prioritise them for each facet.The basic archivability score can be adjusted to penalise theoutcome when the website does not meet the higher rankedcriteria. Further experimentation with the tool will lead toa richer understanding of new directions in automation inweb archiving.