CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

on

  • 748 views

Abstract: Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data ...

Abstract: Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measureWA for any website. Website Archivability captures the core aspects of a website crucial in diagnos-
ing whether it has the potentiality to be archived with com-
pleteness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and influence web design professionals to consider the implications of their design decisions on the likelihood could be archived.
A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability.

Statistics

Views

Total Views
748
Views on SlideShare
525
Embed Views
223

Actions

Likes
0
Downloads
0
Comments
0

5 Embeds 223

http://vbanos.gr 131
http://eventifier.com 60
https://twitter.com 14
http://el.vbanos.gr 14
http://eventifier.co 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Abstract: Web archiving is crucial to ensure that cultural, scientificand social heritage on the web remains accessible and usableover time. A key aspect of the web archiving process is opti-mal data extraction from target websites. This procedure isdifficult for such reasons as, website complexity, plethora ofunderlying technologies and ultimately the open-ended na-ture of the web. The purpose of this work is to establishthe notion of Website Archivability (WA) and to introducethe Credible Live Evaluation of Archive Readiness (CLEAR)method to measureWA for any website. Website Archivabil-ity captures the core aspects of a website crucial in diagnos-ing whether it has the potentiality to be archived with com-pleteness and accuracy. An appreciation of the archivabilityof a web site should provide archivists with a valuable toolwhen assessing the possibilities of archiving material and in-fluence web design professionals to consider the implicationsof their design decisions on the likelihood could be archived.A prototype application, archiveready.com, has been estab-lished to demonstrate the viabiity of the proposed methodfor assessing Website Archivability.
  • Dirty data -> useless systemAs websites become more sophisticated and complex, the difficulties that web bots face in harvesting them increase.For instance, some web bots have limited abilities to process GIS les, dynamic web content, or streaming media [16]. Toovercome these obstacles, standards have been developed to make websites more amenable to harvesting by web bots.Two examples are the Sitemaps.xml and Robots.txt protocols. Such protocols are not used universally.
  • According to the web archiving process followed by the National Library of New Zealand, after performing the harvests, the operators review and endorse or reject the harvested material; accepted material is then deposited in the repository.WCT supports such web archiving processes as permissions, job scheduling, harvesting, quality review, and the collection ofdescriptive metadata. Focusing on quality review, when a harvest is complete, the harvest result is saved in the digital asset store, and the Target Instance is saved in the Harvested state. The next step is for the Target Instance Owner to Quality Review the harvest. WCT operators perform this task manually.E.g. IIPC has organized a Crowdsourcing workshop which included a QA task
  • Website archivability must not be confused with website dependability, the former refers to the ability to archive a website while the latter is a system property that integrates such attributes as reliability, availability, safety, security, survivability and maintainability[1].
  • The concept of CLEAR emerged from our current research in web preservation in the context of the BlogForever project which involves weblog harvesting and archiving. Our work revealed the need for a method to assess website archive readiness in order to support web archiving workflows.
  • Cohesion is tested on three levels:• examining how many hosts are employed in relation to the location of referenced media content,• examining how many hosts are employed in relation to supporting resources (e.g. robots.txt, sitemap.xml,and javascripts),• examining the number of times proprietary software or plugins are referenced.
  • Already contacted by the following institutionsThe Internet Archive,University of Manchester,Columbia University Libraries,Society of California Archivists General Assembly,Old Dominion University, Virginia, USA,Digital Archivists in Netherlands.
  • For instance Metadata breadth and depth might be critical for a particular web archiving research task andtherefore in establishing the archivability score for a particular site the user may which to instantiate this thinking incalculating the overall score. A next step will be to introduce a mechanism to allow the user to weight each Archivability Facet to reflect specific objectives.One way to address these concerns might be to apply an ap-proach similar to normalized discounted cummulative gain(NDCG) in information retrieval49: for example, a user canrank the questions/errors to prioritise them for each facet.The basic archivability score can be adjusted to penalise theoutcome when the website does not meet the higher rankedcriteria. Further experimentation with the tool will lead toa richer understanding of new directions in automation inweb archiving.

CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013 Presentation Transcript

  • 1. CLEAR: a Credible Live Evaluation Method of Website Archivability Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1 1Department of Informatics, Aristotle University, Thessaloniki , Greece 2University of Glasgow, United Kingdom ARCHIVEREADY.COM
  • 2. 2 Table of Contents 1. Problem definition and related work, 2. Our contributions, 3. Website Archivability, 4. CLEAR: A Credible Live Method to Evaluate Website Archivability, 5. Demonstration: http://archiveready.com/, 6. Limitations and Future Work.
  • 3. Problem definition • Web content acquisition is a critical step in the process of web archiving; • If the initial Submission Information Package lacks completeness and accuracy for any reason (e.g. missing or invalid web content), the rest of the preservation processes are rendered useless; • There is no guarantee that web bots dedicated to retrieving website content can access and retrieve it successfully; • Web bots face increasing difficulties in harvesting websites. 3
  • 4. 4 • After web harvesting, administrators review manually the content and endorse or reject the harvested material. • Web harvesting is automated while Quality Assurance (QA) is manual. • Efforts to deploy crowdsourced techniques to manage QA provide an indication of how significant the bottleneck is. Problem definition
  • 5. Inspired by our work at 5 There is a need for a method to assess website archive readiness in order to support web archiving workflow. building a blog preservation software platform http://blogforever.eu
  • 6. 6 1. the introduction of the notion of Website Archivability, 2. the definition of the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure Website Archivability 3. ArchiveReady.com, a web application which implements the proposed method. Our Contributions
  • 7. 7 1. Mechanism to improve the quality of web archives. 2. Expand and optimize the knowledge and practices of web archivists, supporting them in their decision making, and risk management. 3. Standardize the web aggregation practices of web archives, especially QA. 4. Foster good practices in web development, make sites more amenable to harvesting, ingesting, and preserving. 5. Raise awareness among web professionals regarding preservation. Our Aims
  • 8. Website Archivability ? What is Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.
  • 9. CLEAR: A Credible Live Method to Evaluate Website Archivability • An approach to producing on-the-fly measurement of Website Archivability, • Web archives communicate with target websites via standard HTTP, • Information such as file types, content and transfer errors could be used to support archival decisions, • We combine this kind of information with an evaluation of the website's compliance with recognised practices in digital curation, • We generate a credible score representing the archivability of target websites. 9
  • 10. 10 Accessibility Cohesion Standards Compliance Performance Metadata CLEAR: A Credible Live Method to Evaluate Website Archivability
  • 11. 11 Website attributes evaluated using CLEAR
  • 12. 12 C L E A R • The method can be summarised as follows: 1. Perform specific Evaluations on Website Attributes, 2. In order to calculate each Archivability Facet’s score, • Scores range from (0 – 100%), • Not all evaluations are equal, if an important evaluation fails, score = 0, if a minor evaluation fails, score = 50% 3. Producing the final Website Archivability as the sum all Facets’ scores.
  • 13. Accessibility 13
  • 14. Accessibility • A website is considered accessible only if web crawlers are able to visit its home page, traverse its content and retrieve it via standard HTTP requests. 14
  • 15. Accessibility 15 Facet Evaluation Rating Total Accessibility No RSS feed 50% 50% No robots.txt 50% No sitemap.xml 0% 6 links, all valid 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 16. Cohesion 16
  • 17. Cohesion • Relevant to: • Efficient operation of web crawlers, • Management of dependancies with digital curation. • If files constituting a single website are dispersed across different web locations, the acquisition and ingest is likely to risk suffering if one or more web locations fail. • Changes that occur outside the website are not going to affect it if it does not use 3rd party resources. 17
  • 18. Cohesion 18 Facet Evaluation Rating Total Cohesion 1 external and no internal scripts 0% 70% 4 local and 1 external images 80% No proprietary (Quicktime & Flash) files 100% 1 local CSS file 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 19. Metadata 19
  • 20. Metadata • The adequate provision of metadata has been a continuing concern within digital curation. • The lack of metadata impairs the archive’s ability to manage, organise, re trieve and interact with content effectively. 20
  • 21. Metadata 21 Facet Evaluation Rating Total Metadata Meta description found 100% 87% HTTP Content type 100% HTTP Page expiration not found 50% HTTP Last-modified found 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 22. Performance 22
  • 23. Performance Performance is an important aspect of web archiving. The throughput of data acquisition of a web spider directly affects the number and complexity of web resources it is able to process. 23 Facet Evaluation Rating Total Performance Average network response time is 0.546ms 100% 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 24. Standards Compliance 24
  • 25. Standards Compliance • Compliance with standards is a recurring theme in digital curation practices. It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards. 25
  • 26. Standards Compliance 26 Facet Evaluation Rating Total Standards Compliance 1 Invalid CSS file 0% 87% Invalid HTML file 0% Meta description found 100% No HTTP Content encoding 50% HTTP Content Type found 100% HTTP Page expiration found 100% HTTP Last-modified found 100% No Quicktime or Flash objects 100% 5 images found and validated with JHOVE 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  • 27. iPRES 2013 Website Archivability Evaluation 27 Facet Rating Website Archivability Accessibility 50% 77% Cohesion 70% Standards Compliance 77% Metadata 87% Performance 100%
  • 28. ArchiveReady.com Demonstration - Web application implementing CLEAR, - Web interface & also Web API in JSON, - Running on Linux, Python, Nginx, Redis, Mysql. 28
  • 29. 29
  • 30. Impact 30 1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices. 2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites.
  • 31. 31 Limitations & Future Work 1. Not optimal to treat all Archivability Facets as equal. 2. Evaluating a single website page, based on the assumption that web pages from the same website share the same components and standards. Sampling would be necessary. 3. Certain classes and specific types of errors create lesser or greater obstacles to website acquisition and ingest than others. The method needs to be enhanced to reflect this differential valuing of error classes and types.
  • 32. THANK YOU Vangelis Banos Web: http://vbanos.gr/ Email: vbanos@gmail.com ANY QUESTIONS? 32 The research leading to these results has received funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.