Peter Webster - Digital History - 11 June 2013


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Peter Webster - Digital History - 11 June 2013

  1. 1. Web archives: a new class ofprimary source for historians ?Peter Webster (British Library)@pj_webster / @UKWebArchive
  2. 2. 2Scarcity or abundance ?• Rosenzweig, ‘‘Scarcity or Abundance? Preserving the Pastin a Digital Era’ American Historical Review 108, 3 (June2003)
  3. 3. 3Web archiving DIY• BootCat ( )• Wget, by @ianmilligan1( )• The Historian’s WARC Toolkit ( )
  4. 4. 4UK Web Archive• Selective archiving since 2004• 13,000 sites, 60,000 instances,20TB of data• British Library, National Library ofWales, JISC• Plus many collaborators:Women’s Library, Live ArtDevelopment Agency, NHS•
  5. 5. 5An archived (archived 24/5/05) at UK Web Archive
  6. 6. 6An archived website in (archived 24/5/05) at UK Web Archive
  7. 7. 7Non-Print Legal Deposit (web): what maywe collect ?Web resources that:• are issued from a .uk or other UK geographic top-leveldomain, or• where part of the publishing process takes place in the UK;• but excluding any which are only accessible to audiencesoutside the UK.
  8. 8. 8JISC UK Web Domain Dataset 1996-2010• Funded by JISC to create a research collection of UKwebsites• Collaboration between the Internet Archive, JISC and theBritish Library• Copy of subset of the Internet Archive’s web collection thatrelates to the UK• 470466 files (arc.gz & warc.gz), 32TB in total• No local access – possible through the Internet Archive• Can be used to generate secondary datasets
  9. 9. 9Big Data project (Oxford Internet Institute)• “Demonstrating the value of the UK Web Domain Datasetfor social science research”• Led by Professor Helen Margetts• Link analysis of structure of UK government web estate•• Funded by the JISC
  10. 10. 10Datasets available for downloadLink data1996 | | 119GB, available at: (compressed) at: format analysis
  11. 11. 11HTML version analysis
  12. 12. 12Analytical Access to the Domain DarkArchive (AADDA)• Led by Dr Jane Winters (IHR)• In partnership with the British Library and the University ofCambridge•• Funded by the JISC• Bringing together HSS researchers to help the Librarydevelop a web user interface.• Feb 2012 – Oct 2013
  13. 13. 13Ngram: Prime Ministers
  14. 14. 14Simple search
  15. 15. 15Search facets• crawl date (year)• host, private suffix and public suffix (,,• outward links (to host, private suffix or public suffix)• content type (HTML, PDF, image)• language• postcode district (M20, WC1E)• sentiment
  16. 16. 16Simple search with facets
  17. 17. 17Proximity search with facets
  18. 18. 18Archived site in Internet Archive
  19. 19. 19Methodological challenges: what is in thearchive ?• National web archives: some selective, some legal deposit• When is comprehensive not comprehensive ?• Defining the national ( )
  20. 20. 20Methodological challenges: when was it inthe archive ?• Understanding the crawl profile• Crawl date NOT publication date• Citation standard: what, when archived
  21. 21. 21Thank you ! /