Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building a Collection of the Historical UK Web for scholarly use 
Helen Hockx-Yu 
Head of Web Archiving, British Library
www.bl.uk 
2 
The UK Web Domain 
4th TLD after .com, .de and .net 
Over 10 million .uk registered domain 
UK organisati...
www.bl.uk 
3 
Web Archiving at the British Library 
Collect UK digital heritage and provide continued access to archived ...
www.bl.uk 
4 
Domain Crawl 
News 
Special collection 
Special collection 
Domain crawl: 
•Broad sweep of UK domain 
•Once ...
www.bl.uk 
5 
UK websites – territoriality explained 
An online work is considered as “published in the UK” and therefore ...
www.bl.uk 
6 
Territoriality - implementation 
All websites with a .uk domain name 
Including embedded content (eg CSS, ...
www.bl.uk 
7 
UK Domain Crawl 
2013 domain crawl stats 
3.86 million seeds 
1.9 billion URLs (web pages, docs, images) ...
www.bl.uk 
8 
The “access” paradoxes 
Completeness versus openness of web archives 
Legal Deposit national collections h...
www.bl.uk 
9 
9 
Web archive as historical document
www.bl.uk 
10 
Collaboration with researchers 
Building collections 
Researchers’ involvement in scoping collections, se...
www.bl.uk 
11 
JISC UK Web Domain dataset (1996-2013) 
Collaboration between the Internet Archive (IA), the Joint Informa...
www.bl.uk 
12 
Completed work 
Analytical Access to the Domain Dark Archive Project 
Use cases & experimental UI 
Demon...
www.bl.uk 
13 
Exploring Host Link Graph 
Courtesy of Peter Webster, Rainer Simon and Jules Mataly
www.bl.uk 
14 
Visualising links (to and from bl.uk) 
Interactive version How it is done
www.bl.uk 
15 
Visualising links (to and from bl.uk) 
Interactive version How it is done
www.bl.uk 
16 
Evolution of the UK web (2004 -2013)
www.bl.uk 
17 
Memento service
www.bl.uk 
18 
Big UK Domain Data for Arts and Humanities 
Funded by the UK Arts and Humanities Research Council as one o...
www.bl.uk 
19 
Web archiving researcher bursaries
www.bl.uk 
20 
Query building 
Corpus formation and handling 
Annotation and curation 
In-corpus analysis 
Whole-data...
www.bl.uk 
21 
What’s in it for us? 
Helps researchers understand the value of web archives and explore new ways of using...
www.bl.uk 
22 
Web archives for reference AND for analytics 
Base-line knowledge self-explanatory 
Focus on national ev...
Upcoming SlideShare
Loading in …5
×

Building a Collection of the Historical UK Web for scholarly use

636 views

Published on

Helen Hockx-Yu
Head of Web Archiving, British Library Presentation at ALISS 2014 event.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Building a Collection of the Historical UK Web for scholarly use

  1. 1. Building a Collection of the Historical UK Web for scholarly use Helen Hockx-Yu Head of Web Archiving, British Library
  2. 2. www.bl.uk 2 The UK Web Domain 4th TLD after .com, .de and .net Over 10 million .uk registered domain UK organisations also use non .uk domain names (eg .com or .org) – scale unknown Non-print Legal Deposit (since April 2013) applies to the open (freely available) web: .uk and other UK-published (non .uk) websites, such as .com, .org… also e-journals, e-books, news web pages and other digital publications, either by harvesting or mutual agreement on other delivery methods
  3. 3. www.bl.uk 3 Web Archiving at the British Library Collect UK digital heritage and provide continued access to archived web resources Started web archiving in 2003: Open UK Web Archive Selective, topical collections and key sites Consortium sharing infrastructure and development effort; agreement on who collects what Curating collections with organisations and researchers Archiving UK Web for non-print Legal Deposit since April 2013: Legal Deposit UK Web Archive Comprehensive national archive with on-site access only Joint responsibility of six Legal Deposit Libraries (LDLs)
  4. 4. www.bl.uk 4 Domain Crawl News Special collection Special collection Domain crawl: •Broad sweep of UK domain •Once or twice a year Events & key sites and news: •Events of UK interest •High value, high impact sites •National & regional news Special Collection: •Focused, thematic collections •Support priority subjects Key sites Events Special collection Special collection Collecting strategy for websites
  5. 5. www.bl.uk 5 UK websites – territoriality explained An online work is considered as “published in the UK” and therefore in scope for Legal Deposit, if it meets either of the following criteria: (a) it is made available to the public from a website with a domain name which relates to the United Kingdom or to a place within the United Kingdom; or (b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom The Legal Deposit Libraries (Non-Print Works) Regulations, 2013
  6. 6. www.bl.uk 6 Territoriality - implementation All websites with a .uk domain name Including embedded content (eg CSS, images) regardless where it is hosted non .uk websites have to meet at least one criteria UK Hosting: check external IP geo-location database and add in-scope URLs to the fetch-chain UK postal address Correspondence Professional judgement
  7. 7. www.bl.uk 7 UK Domain Crawl 2013 domain crawl stats 3.86 million seeds 1.9 billion URLs (web pages, docs, images) ~31TB Duration: 70days 2014 domain crawl 90 million seeds (starting URLs) Started on 19th June 2014 Collected 52TB of data (by 9th December (incl. 4.4GB of viruses & 3TB of homepage screenshots) Nearly 2 million non .uk domains
  8. 8. www.bl.uk 8 The “access” paradoxes Completeness versus openness of web archives Legal Deposit national collections have restricted access Documents-centred versus data driven Essentially a scale issue Pre-selected or defined collections not relevant to all researchers; difficulty in finding relevant content in large scale web archive. Arbitrary (national) boundaries often irrelevant to research question but most heritage institutions operation within certain geographical areas …
  9. 9. www.bl.uk 9 9 Web archive as historical document
  10. 10. www.bl.uk 10 Collaboration with researchers Building collections Researchers’ involvement in scoping collections, selecting and describing websites Creation of specific, (narrow) topical collections Formulating research question Brain-storm sessions, workshops, discussion, surveys etc. Lack of awareness & baseline knowledge Challenging: you don’t know what you don’t know Co-development of access services This is changing how we collect and store data
  11. 11. www.bl.uk 11 JISC UK Web Domain dataset (1996-2013) Collaboration between the Internet Archive (IA), the Joint Information Systems Committee (JISC) and the British Library Extracted copies of UK websites from the Internet Archives collection 1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs 2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated) Research agreement between JISC and IA, upholding IA’s Terms of Use Access via IA’s Wayback Machine Allows replication / extraction of derivative or secondary datasets BL hosts the dataset on behalf of JISC Data used by research projects Institute of Historical Research project: Analytical Access to the Domain Dark Archive (AADDA) Oxford Internet Institute project: Big data for political science
  12. 12. www.bl.uk 12 Completed work Analytical Access to the Domain Dark Archive Project Use cases & experimental UI Demonstrating the Value of the UK Web Domain Dataset for Social Science Research  Analysis of link graph Paper accepted for WebSci’14: Mapping the UK Webspace: Fifteen Years of British Universities on the Web MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher: Creating and Analysing Secondary datasets under open licence Format profile, Geoindex, Host Link Graph
  13. 13. www.bl.uk 13 Exploring Host Link Graph Courtesy of Peter Webster, Rainer Simon and Jules Mataly
  14. 14. www.bl.uk 14 Visualising links (to and from bl.uk) Interactive version How it is done
  15. 15. www.bl.uk 15 Visualising links (to and from bl.uk) Interactive version How it is done
  16. 16. www.bl.uk 16 Evolution of the UK web (2004 -2013)
  17. 17. www.bl.uk 17 Memento service
  18. 18. www.bl.uk 18 Big UK Domain Data for Arts and Humanities Funded by the UK Arts and Humanities Research Council as one of the 21 “Big Data” projects Collaboration between the Institution of Historical Research, Oxford Internet Institute, British Library and Aarhus University Develop theoretical and methodological framework for the study of web archives Build on ADDAA: researchers and the BL co-produce access tools  A major study of the history of UK web space from 1996 to 2013 + sub-projects covering a range of disciplines  Also an online training course and peer-reviewed journal articles.
  19. 19. www.bl.uk 19 Web archiving researcher bursaries
  20. 20. www.bl.uk 20 Query building Corpus formation and handling Annotation and curation In-corpus analysis Whole-dataset analysis Shine
  21. 21. www.bl.uk 21 What’s in it for us? Helps researchers understand the value of web archives and explore new ways of using these for scholarly research Allows BL to obtain hands-on experience with indexing and processing large scale web archive datasets (Prototypes) analytics and visualisations can be applied to our own Legal Deposit collection Enables BL to participate in various UK, European and international projects Helps curators understand characteristics of large scale digital corpora Improve the way we collet and store web archive
  22. 22. www.bl.uk 22 Web archives for reference AND for analytics Base-line knowledge self-explanatory Focus on national events for curated collections; provide means to assemble research corpora Link to what we do not have Offer a bag of tools to support scholarly use The go-to state Exploit open licences, changes to copyright law Online access to selected websites, metadata and secondary datasets The British Library Collection Development Policy for websites Lobbying – review of Non-print Legal Deposit Regulations in 2018

×