This document summarizes a project that analyzes the JISC UK Web Domain Dataset from 1996-2013 to understand the development of UK web space over time. The project aims to establish frameworks for analyzing web archives and explore ethical implications. It will produce tools to support analysis, case studies across disciplines, and training materials. The dataset contains around 300 million resources from the UK web captured by the Internet Archive, but lacks metadata about subjects and dates. The project highlights the value of web archives as historical sources.
2. www.bl.uk 2
Big UK Domain Data for the Arts &
Humanities
• Led by Dr Jane Winters (IHR) (@jfwinters)
• In partnership with the British Library and the Oxford Internet Institute
• With the help of Niels Brügger (Aarhus University, Denmark) @NielsBr
• Co-investigators: Ralph Schroeder, Eric Meyer (@etmeyer), Helen
Hockx-Yu (@hhockx)
• The team includes: Jonathan Blaney, @JoshCowls, @anjacks0n
• Funded by the AHRC, Jan 2014 – March 2015
• http://buddah.projects.history.ac.uk/
3. www.bl.uk 3
Project aims
• To highlight the value of web archives as a source for A&H, & to
transform the way in which researchers interact with the data
• To establish a theoretical and methodological framework for the analysis
of web archives, focusing on the JISC UK Web Domain Dataset
• To explore the ethical implications of big data research, and particularly
as they relate to the web
• To inform collection development and access arrangements for the UK
web archive at the British Library
4. www.bl.uk 4
Project outputs
• a suite of tools to support analysis of web archives by A&H researchers
• an enhanced interface through which researchers access the archived
material held by the British Library
• a history of the development of UK web space from 1996 to 2013,
analysing technical, social, organisational and cultural developments
and trends in the dataset
• a series of case studies across a range of A&H disciplines
• two project workshops, bringing together researchers, archivists,
technologists, and digital preservation professionals
• a free online training module illustrating the use of web archives and the
application of big data techniques and methods.
5. www.bl.uk 5
Forthcoming event
Web archives as big data
Wednesday, 3 December 2014 from 09:45 to 17:30
IHR, Senate House, United Kingdom
Booking at: http://tinyurl.com/webarchives
6. www.bl.uk 6
A new class of primary source ?
Deswarte and Webster,
“Web Archives: A New Class of Primary Source for
Historians?”
IHR Digital History seminar, 2013, reporting on predecessor
project (AADDA)
http://tinyurl.com/qca3yy5
8. www.bl.uk 8
The UK Web Archive: three archives in one
Open UK Web Archive (2004-)
• c.14,000 sites
• Curated, selective, permission-based
• webarchive.org.uk
Legal Deposit UK Web Archive (2013-)
• legal framework
• c.4-5 million hosts per year
• onsite only
JISC UK Web Domain Dataset
9. www.bl.uk 9
JISC UK Web Domain Dataset 1996-2013
• Funded by JISC to create a research collection of UK
websites
• Collaboration between the Internet Archive, JISC and the
British Library
• Copy of subset of the Internet Archive’s web collection that
relates to the UK
• c.300 million resources, 60TB in total
• No local access – possible through the Internet Archive
• Can be used to generate secondary datasets
10. www.bl.uk 10
Use cases (generalised)
• Full-text/facet search -> individual resource
• Full-text/facet search -> analysis/visualisation
• Search -> corpus creation -> annotation/curation
• Corpus creation -> full-text search -> individual resource
• Corpus -> search -> analysis/visualisation
• Derived datasets -> take-away
• Direct access to WARC -> take-away
11. www.bl.uk 11
What do we know about each resource ?
From the crawl data
• crawl date
• URL (/page.html, host.domain.co.uk, domain.co.uk, .co.uk)
• file format
• file size
12. www.bl.uk 12
What do we know about each resource ?
From the full-text index
• page title
• link destinations (host.domain.co.uk, domain.co.uk, .co.uk)
• author (sometimes)
• language (sometimes)
13. www.bl.uk 13
What *don’t* we know ?
• subject
• geographic scope
• publisher
• date of publication
• date of last amendment