Scaling up to archive the UK Web. Helen Hockx-Yu


Published on

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling up to archive the UK Web. Helen Hockx-Yu

  1. 1. Scaling up to archive the UK Web Helen Hockx-Yu
  2. 2. 2 2001-2002 Explore  Launch Domain.UK project  No public access Collaborate 2003-2008  Establish Web Archiving Programme  Lead UK Web Archiving Consortium  Launch UK Web Archive Build capacity BAU 2008-2011  People, systems and processes  Curatorial expertise  Technical know-how 2011  Web Archiving as operational unit  Implement non-print Legal Deposit since April 2013 Web Archiving Timeline 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
  3. 3. 3 Before (6 April 2013) • Selective archiving of websites that – reflect the diversity of lives, interests and activities throughout the UK – contain research value or are of research interest – feature political, cultural, social and economic events of national interest – demonstrate innovative use of the web – Also prioritise websites at risk and web-only content • Permission based – Permission to archive, to provide online access and to preserve. Also ask or 3rd party rights clearance – 30% success rate, 5% explicit refusal (mostly due to 3rd party rights) • Online access through UK Web Archive
  4. 4. 4 Toolset • Selection and Permission Tool – selection and permission management – Integrated with the Web Curator Tool • Web Curator Tool – Job scheduling – Metadata – Access control – Harvesting (uses Heritirx) – QA • Indexing and SIP generation – scripts and SOLR (for full-text index) • Wayback – rendering tool for WARCs • UK Web Archive – web-based end user interface
  5. 5. 5 Access •Currently 3 ways to access the web archive – Online through the UK Web Archive – Catalogue records (of special collections) – Keywords search through primo (corporate resource discovery system) •Conduct researcher survey / research projects to understand requirements
  6. 6. 6 A catalogue record for a collection
  7. 7. 7 Keyword search through Primo
  8. 8. 8 UK Web Archive • 14,118 websites, 60,482 instances, 17.6TB WARCs • Over 182,761 unique visits 1st April ‘12 – 31st March ‘13 • Key websites include videos • Full-text, N-gram, title and URL search • Browse by subject / special collection, visual browsing • Analytical access
  9. 9. 9 Analytical access • Shift of focus from the level of single webpages or websites to the entire web archive collection. • Use web archives as datasets, access to metadata and knowledge about websites • Support survey, annotation, contextualisation and visualisation • Allows discovery of patterns, trends and relationships in inter-linked web pages • Helps addresses a number of challenging issues – Scalability – Accessibility of individual websites – Components missed by crawlers
  10. 10. 10 After (6 April 2013) • Government introduced Non-print Legal Deposit Regulations 2013 • Apply to material published digitally and online, including articles books, and websites. • 6 UK Legal Deposit Libraries • Deposited content accessible “on library premises controlled by the deposit library” – after 7 days of collection or deposit – Single concurrent access – Catalogue records allowed to be searchable online – Digital copying not permitted
  11. 11. 11 Legal Deposit of UK websites • In scope – Sites that use a .uk or other UK geographic top-level domain – where part of the publishing process takes place in the UK; • Will not archive – sites concerning film and recorded sound where the audio-visual content predominates – private intranets and emails • Over 10 million .uk registered domains – 4th TLD after .com, .de and .net – UK organisations also use non .uk domain names (eg .com or .org) – scale unknown
  12. 12. 12 Domain Crawl News S p e c i a l c o l l e c t i o n S p e c i a l c o l l e c t i o n Domain crawl: • Broad sweep of UK domain • Once or twice a year Events & key sites and news: • Events of UK interest • High value, high impact sites • National & regional news Special Collection: • Focused, thematic collections • Support priority subjects Key sitesEvents S p e c i a l c o l l e c t i o n S p e c i a l c o l l e c t i o n Collecting strategy
  13. 13. 13 Access strategy • Deposited content cannot be accessed outside the reading rooms. • Online access can be provided to metadata and selected content to showcase the Legal Deposit web archive of the UK – Bibliographic metadata – Analysis and visualisation of aggregated content – Statistical and contextual data – Copy of deposited content with direct permission • For sites from outside the UK, permission both to harvest and for public access will be required
  14. 14. 14 Before and after: what has changed • Everything! BEFORE AFTER Scale 14,000 4 – 5 million Purpose Advocacy, demonstrating benefits Legal Deposit Workflow (and tools) Selection prior to harvesting Selection / curation can happen post harvesting Permission to archive Required Can collect in-scope material without permission Access Online Reading rooms only (unless with direct permission for online access) Nature of QA Quality control leading to deselection Flagging up quality issues Ownership British Library Legal Deposit Libraries
  15. 15. 15 Progress • Experimental domain crawl in August-December 2012, no access – Started with 4.8 million seeds – Collected 27TB data +1TB of crawl logs • 1st Legal Deposit domain crawl started in April – Started with 3.8 million seeds – Ran between 8th April - 21st June and collected over 31TB data • Focused collection on National Health Service Reform – Showcase end-to-end processes including ingest and access in reading room in early July • Selecting key sites, news site and events
  16. 16. 16 Collection: National Health Service (NHS) Reform
  17. 17. 17 Challenges • Legal deposit territoriality and scope • Advanced content • User experience • Monitoring software, rendering engine • Change of business processes
  18. 18. 18 Thank you