The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna


Published on

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

  1. 1. The Web is a Mess How I learned to stop worrying and love web archiving
  2. 2. We are a Digital Library Mission Statement: Universal access to all knowledge o Founded by Brewster Kahle in San Francisco, California in 1996 o Officially designated a Library by the State of California in 2007 About Internet ArchiveAbout Internet Archive
  3. 3. 500,000 Books
  4. 4. 500,000 500,000 Books Moving Images
  5. 5. 500,000 500,000 1,000,000 Books Moving Images Audio Recordings
  6. 6. 500,000 500,000 1,000,000 2,000,000 Books Moving Images Audio Recordings Hours of TV
  7. 7. 500,000 500,000 1,000,000 2,000,000 3,600,000 Books Moving Images Audio Recordings Hours of TV eBooks
  8. 8. The Archive is accessible to the public via the website: o Started collecting content in 1996 o First web pages public available in 2001 o 347+ billion web pages o 200+ million websites o Almost every domain o Content in 140+ Languages o Collect a broad summary of the web every 30-60 days - approximately 10 billion pages per snapshot Access to General Web Archive Access to General Archive
  9. 9. What is Web Archiving? Web archiving is the process of collecting portions of web content, preserving the collections, and then providing access to the archives - for use and re use.
  10. 10. A web archive is a collection of archived URLs grouped by theme, event, subject area, or web address. A web archive contains as much as possible from the original resources and documents the change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on the day it was archived. What is a Web Archive?
  11. 11. Who is archiving the webWho is web archiving?
  12. 12. Why are We Doing This? • Web archives preserve the web. They act as the web equivalent of the archive or library. In this role, their mission is to acquire and preserve the web for future generations… ensuring its continued survival for future generations • Billions of people around the world have grown accustomed to using the web as their primary resource to acquire information. • The availability of this electronic information is taken for granted and it is a fallacy that if something is on the web it will be there forever. • There’s an essential need for people to understand that the web represents who we are. It’s our culture and our social fabric, and we don’t want to lose it.
  13. 13. Why should we archive the web?
  14. 14. How long does a website live? • A 1997 report in Scientific American claims 44 days. • A subsequent academic 2001 study in IEEE suggests 75 days. • A 2003 Washington Post article indicates the number is 100 days. • A 2013 study by Old Dominion University says that after the first year of publishing, nearly 11% of social media will be lost and after that we will continue to lose 0.02% per day How long does a website live?
  15. 15. • Create a thematic/topical web archive on a specific subject • Capture ‘at risk’ content during a spontaneous event • Fulfill organizational mandate to preserve institutional memory & history • Archive state/local agency publications no longer deposited in print form • Archive records to meet university and/or government retention policies. • Collect content to act as a research service for scholars to turn to • Capture social media sites as part of organizational records • Collect web-based information to augment physical holdings. • Archive online art ephemera • End of Life/Closure Web Archiving Use Cases
  16. 16. How does web archiving work?
  17. 17. What is a crawler? A crawler is the software that captures and archives web pages. A crawler visits a page and indexes the content included therein
  18. 18. Some technical challenges in capturing content • Technical: dynamic content utilize scripting languages (Flash and JavaScript). The web is a hodgepodge of technologies, some old and outdated, others at the cutting edge. • Capturing social media sites has become necessary as the web is moving away from html and moving towards applications • Explore other capture mechanisms besides using a traditional crawler resource: hybrid architecture/API/headless browsers
  19. 19. joseph-robertson/ Amount of content that is being archived Amount of data being created by content providers Challenge: a lot of data
  20. 20. Challenge: How much to archive? There Are LimiTs…
  21. 21. Challenge: What to archive? …What is important to you? What do you want people to know about? What are your organization’s collecting activities? Vision?
  22. 22. Participant Poll • Does any of this make any sense?
  23. 23. Managing Collections
  24. 24. Starting a Collection Collection: A group of URLs crawled and organized around a common theme, topic or domain Ask Yourself: • What is the topic of this collection? • What websites would you like to archive as part of this collection?
  25. 25. Collections Start with Seeds • Seed: starting point URL for the crawler. The crawler will follow linked pages from your seed URL and archive them if they are ‘in scope’. • Document: any file with a distinct URL (html, image, PDF, video, etc).
  26. 26. Some of our Partner’s Digital Collections • Stanford University (Palo Alto California) • American University in Cairo • Biblioteca Nacional de España
  27. 27. Stanford University, Islamic & Middle Eastern Collection Use Case: harvest and preserve Iranian Blogs • Archiving over 300 blogs written by and for Iran and the Iranian people • Includes coverage of 2009 Iranian elections and the current Middle East unrest
  28. 28. Stanford and New York Universities Islamic and Middle Eastern Collection
  29. 29. American University of Cairo Use Case: The American University in Cairo Web Archive collects, preserves, and provides access to the web content published by students, faculty, departments, and offices at AUC. The archive also collects Web documents that have long-term research or historical value.
  30. 30. January 25th Revolution and University on the Square Demonstrators in Tahrir Square. Image courtesy of Ahmad and the American University in Cairo Rare Books and Special Collections Library.
  31. 31. Archivist Driven Captures Thank you to Egypt's youth and Facebook . Image courtesy of Martin and Amy Rowe and the American University in Cairo Rare Books and Special Collections Library.
  32. 32. Patron Driven Captures Screenshot of the University on the Square Contribution form. In addition to soliciting photos and videos, we asked content providers to websites, blogs, Twitter feeds, etc.
  33. 33. Archivist as Advocate Protester documenting the demonstrations in Tahrir Sqare. Image courtesy of Robeir Rasmy and the American University in Cairo Rare Books and Special Collections Library.
  34. 34. Breaking down the life cycle • One of its top priorities as a memory institution is to consolidate whichever strategies lead to the integral preservation of Spanish Internet-published contents, in accordance with the library's mission as keeper and disseminator of Spanish culture. • Commitment to its patrons, who expect the web archive to become a publicly and freely accessible key information source for the study of the 21st century. Biblioteca Nacional de España
  35. 35. Breaking down the life cycle Use cases: • 2011 Election crawl • 2012 Humanities crawl • 2009-present .es domain crawls • 2013 .es Broad Survey Crawl, visited the top level page of every web site registered to .es ( in partnership with • 2011-2013 Thematic curation (World cups, Olympics,Global Hunger) Biblioteca Nacional de España
  36. 36. http://www.udatleticoisleñ
  37. 37.
  38. 38. • Archived wen page from Facebook and/or Flickr
  39. 39. irata_(España)
  40. 40.
  41. 41.
  42. 42.
  43. 43. Not available on the live web
  44. 44.
  45. 45. Not available on the live web
  46. 46. Making sense of it all • Web Archiving life cycle /model • Internet Archive future objectives – Social Media – Distributed Content – Visualization and analytical tools for more useful interaction – Search – Mobile platforms – Enhanced Researcher Access
  47. 47. Web Archiving Life Cycle Model Web Archiving Life Cycle Model white paper available:
  48. 48. Breaking down the life cycle Outer layer: • Vision and Objectives • Resources and Workflow • Access / Use / Reuse. • Preservation • Risk Management Inner Circle: • Appraisal and Selection. • Scoping • Data Capture • Storage and Organization • Quality Assurance and Analysis Breaking down the life cycle
  49. 49. Participant Poll • Are you confused yet? I hope not. Happy to answer questions!
  50. 50. The importance of web archiving “As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations” Kalev H. Leetaru, University of Illinois
  51. 51. Kristine Hanna, Director, Archiving Services Internet Archive Thank you!