Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Prospects and pitfalls in using web archives for research

682 views

Published on

A lecture given at the Moore Institute at the National University of Ireland Galway. It lays out the case for archiving the web as a source for future scholarly enquiry; examines the state of play of web archiving in Ireland; outlines the broad use cases for the archived web; and presents results from research into creationism on the web in the UK and in Ireland.

Published in: Internet
  • Be the first to comment

Prospects and pitfalls in using web archives for research

  1. 1. A new class of primary source? Prospects and pitfalls in using web archives for research Dr Peter Webster Webster Research and Consulting @pj_webster
  2. 2. A lost archive?
  3. 3. A lost archive?
  4. 4. A lost archive?
  5. 5. The web its own archive? Open UK Web Archive 2004-13 comparison. @anjacks0n http://britishlibrary.typepad.co.uk/webarchive/2014/10/what-is-still-on- the-web-after-10-years-of-archiving-.html
  6. 6. Disappearing predictably
  7. 7. Disappearing unpredictably
  8. 8. .. But safe and sound in the archive
  9. 9. Reasons to care about web archiving • education and research • enforcement of the law • public accountability
  10. 10. Three archives for the UK Temporal scope Content scope Access Open UKWA 2004-present Selective (14.7k) Online Legal Deposit UKWA 2013-present Comprehensive (for UK) Onsite JISC UK Domain Dataset 1996-2013 Comprehensive (for .uk) Index only
  11. 11. JISC UK Web Domain Dataset (1996-2013) • copy of Internet Archive holdings for .uk • bought by JISC, held by British Library • 60TB of data • no direct access to content • prototype search at webarchive.org.uk/shine • derived datasets in public domain
  12. 12. Web archives for NI and RoI Temporal scope Content scope Access NLI Web Archive 2011-present Selective (542) Online PRONI Web Archive 2010-present Selective (115) Online Legal Deposit UKWA 2013-present Comprehensive (for UK!) Onsite (TCD)
  13. 13. Ways to use the archived web • URL search -> single page • Full-text search -> single page • Visualisation -> trend -> page
  14. 14. Changing aesthetics gov.ie, captured by archive.org, 15 August 2000
  15. 15. Vanished content southtippcoco.ie, captured by archive.org, 4 Jan 2014
  16. 16. Visualising trends: Ngram http://www.webarchive.org.uk/shine/graph
  17. 17. Ways to use the archived web • URL search -> single page • Full-text search -> single page • Visualisation -> trend -> page • Direct access to WARC • Derived datasets • API access
  18. 18. Derived datasets from the BL From JISC UK Web Domain Dataset (1996- 2010) • File format profile • Geo-index • Crawled URL Index (CDX) • Host Link Graph Public domain at data.webarchive.org.uk
  19. 19. Creationism ? • non-evolutionary account of human origins • modern • a long history • a feature of some parts of evangelicalism • (anti-evolutionism, Intelligent Design)
  20. 20. The creationist web : three questions A justified conspiracy theory about marginalisation of creationist voices? A real danger or a moral panic (Truth in Science) ? The web as friend of the marginalised opinion? http://peterwebster.me/2014/11/18/reading-creationism-in-the-web-archive/
  21. 21. UK Host Link Graph (1996- 2010) 2008 | newsimg.bbc.co.uk | youtube.com | 45 2008 | archbishopofyork.org.uk | flickr.com | 1 2002 | secularism.org.uk | geocities.com | 1 Public domain at: data.webarchive.org.uk
  22. 22. Approach • selection of key UK creationist sites • extraction of all unique inbound referring hosts for 1996-2010 • inspection and classification
  23. 23. Caveats on method • partial nature of the dataset • benchmarking of absolute numbers • selective sample • what does a link mean, anyway ? • not looking at number of linking resources per host
  24. 24. Truth in Science: how significant? • only 46 unique inbound hosts • … of which many were other creationists or secularist sites • two churches, one school • fewer in 2010 than 2007
  25. 25. Conclusions • a utopian dream unfulfilled • a genuine moral panic • a justified conspiracy theory
  26. 26. Next steps (1) 1. NI the 'creationism capital of Europe'? (Analysis of: • links from GB organisations to NI creationists • links from NI to RoW) 2. What about creationism in .ie ?
  27. 27. Next steps (2) Project: EU National Web Spheres • part of resaw.eu • investigating the nature of a national web domain • .. including the interlinking between them • case study I: Anglican & Presbyterian churches in Ireland, north and south
  28. 28. Web Archives for Historians @HistWebArchives , http://webarchivehistorians.org/
  29. 29. Questions ? Peter Webster peter@websterresearchconsulting.com @pj_webster peterwebster.me websterresearchconsulting.com

×