The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

  • 2,109 views
Uploaded on

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,109
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Web is a Mess How I learned to stop worrying and love web archiving
  • 2. We are a Digital Library Mission Statement: Universal access to all knowledge o Founded by Brewster Kahle in San Francisco, California in 1996 o Officially designated a Library by the State of California in 2007 About Internet ArchiveAbout Internet Archive
  • 3. 500,000 Books
  • 4. 500,000 500,000 Books Moving Images
  • 5. http://flickr.com/photos/marfis75/ 500,000 500,000 1,000,000 Books Moving Images Audio Recordings
  • 6. 500,000 500,000 1,000,000 2,000,000 Books Moving Images Audio Recordings Hours of TV
  • 7. 500,000 500,000 1,000,000 2,000,000 3,600,000 Books Moving Images Audio Recordings Hours of TV eBooks
  • 8. The Archive is accessible to the public via the website: www.archive.org o Started collecting content in 1996 o First web pages public available in 2001 o 347+ billion web pages o 200+ million websites o Almost every domain o Content in 140+ Languages o Collect a broad summary of the web every 30-60 days - approximately 10 billion pages per snapshot Access to General Web Archive Access to General Archive
  • 9. What is Web Archiving? Web archiving is the process of collecting portions of web content, preserving the collections, and then providing access to the archives - for use and re use.
  • 10. A web archive is a collection of archived URLs grouped by theme, event, subject area, or web address. A web archive contains as much as possible from the original resources and documents the change over time. It is a priority to recreate the same experience a user would have had if they had visited the live site on the day it was archived. What is a Web Archive?
  • 11. Who is archiving the webWho is web archiving?
  • 12. Why are We Doing This? • Web archives preserve the web. They act as the web equivalent of the archive or library. In this role, their mission is to acquire and preserve the web for future generations… ensuring its continued survival for future generations • Billions of people around the world have grown accustomed to using the web as their primary resource to acquire information. • The availability of this electronic information is taken for granted and it is a fallacy that if something is on the web it will be there forever. • There’s an essential need for people to understand that the web represents who we are. It’s our culture and our social fabric, and we don’t want to lose it.
  • 13. Why should we archive the web?
  • 14. How long does a website live? • A 1997 report in Scientific American claims 44 days. • A subsequent academic 2001 study in IEEE suggests 75 days. • A 2003 Washington Post article indicates the number is 100 days. • A 2013 study by Old Dominion University says that after the first year of publishing, nearly 11% of social media will be lost and after that we will continue to lose 0.02% per day How long does a website live?
  • 15. • Create a thematic/topical web archive on a specific subject • Capture ‘at risk’ content during a spontaneous event • Fulfill organizational mandate to preserve institutional memory & history • Archive state/local agency publications no longer deposited in print form • Archive records to meet university and/or government retention policies. • Collect content to act as a research service for scholars to turn to • Capture social media sites as part of organizational records • Collect web-based information to augment physical holdings. • Archive online art ephemera • End of Life/Closure Web Archiving Use Cases
  • 16. How does web archiving work?
  • 17. What is a crawler? A crawler is the software that captures and archives web pages. A crawler visits a page and indexes the content included therein
  • 18. Some technical challenges in capturing content • Technical: dynamic content utilize scripting languages (Flash and JavaScript). The web is a hodgepodge of technologies, some old and outdated, others at the cutting edge. • Capturing social media sites has become necessary as the web is moving away from html and moving towards applications • Explore other capture mechanisms besides using a traditional crawler resource: hybrid architecture/API/headless browsers
  • 19. http://www.chaitalag.com/new/s/tubig http://www.helenbrowngroup.com/2011/02/rescue-from-the-digital-firehose/gushing-firehose-by- joseph-robertson/ Amount of content that is being archived Amount of data being created by content providers Challenge: a lot of data
  • 20. Challenge: How much to archive? There Are LimiTs…
  • 21. Challenge: What to archive? …What is important to you? What do you want people to know about? What are your organization’s collecting activities? Vision?
  • 22. Participant Poll • Does any of this make any sense?
  • 23. Managing Collections
  • 24. Starting a Collection Collection: A group of URLs crawled and organized around a common theme, topic or domain Ask Yourself: • What is the topic of this collection? • What websites would you like to archive as part of this collection?
  • 25. Collections Start with Seeds • Seed: starting point URL for the crawler. The crawler will follow linked pages from your seed URL and archive them if they are ‘in scope’. • Document: any file with a distinct URL (html, image, PDF, video, etc).
  • 26. Some of our Partner’s Digital Collections • Stanford University (Palo Alto California) • American University in Cairo • Biblioteca Nacional de España
  • 27. Stanford University, Islamic & Middle Eastern Collection Use Case: harvest and preserve Iranian Blogs • Archiving over 300 blogs written by and for Iran and the Iranian people • Includes coverage of 2009 Iranian elections and the current Middle East unrest
  • 28. Stanford and New York Universities Islamic and Middle Eastern Collection
  • 29. American University of Cairo Use Case: The American University in Cairo Web Archive collects, preserves, and provides access to the web content published by students, faculty, departments, and offices at AUC. The archive also collects Web documents that have long-term research or historical value.
  • 30. January 25th Revolution and University on the Square Demonstrators in Tahrir Square. Image courtesy of Ahmad and the American University in Cairo Rare Books and Special Collections Library.
  • 31. Archivist Driven Captures Thank you to Egypt's youth and Facebook . Image courtesy of Martin and Amy Rowe and the American University in Cairo Rare Books and Special Collections Library.
  • 32. Patron Driven Captures Screenshot of the University on the Square Contribution form. In addition to soliciting photos and videos, we asked content providers to websites, blogs, Twitter feeds, etc.
  • 33. Archivist as Advocate Protester documenting the demonstrations in Tahrir Sqare. Image courtesy of Robeir Rasmy and the American University in Cairo Rare Books and Special Collections Library.
  • 34. Breaking down the life cycle • One of its top priorities as a memory institution is to consolidate whichever strategies lead to the integral preservation of Spanish Internet-published contents, in accordance with the library's mission as keeper and disseminator of Spanish culture. • Commitment to its patrons, who expect the web archive to become a publicly and freely accessible key information source for the study of the 21st century. Biblioteca Nacional de España
  • 35. Breaking down the life cycle Use cases: • 2011 Election crawl • 2012 Humanities crawl • 2009-present .es domain crawls • 2013 .es Broad Survey Crawl, visited the top level page of every web site registered to .es ( in partnership with Red.es) • 2011-2013 Thematic curation (World cups, Olympics,Global Hunger) Biblioteca Nacional de España
  • 36. http://www.udatleticoisleño.es
  • 37. http://www.facebook.com/eajpnv
  • 38. http://twitter.com/xalmar • Archived wen page from Facebook and/or Flickr
  • 39. http://es.wikipedia.org/wiki/Partido_P irata_(España)
  • 40. http://www.estrelladigital.es
  • 41. http://leer.es
  • 42. http://iuabierta.blogspot.com
  • 43. Not available on the live web
  • 44. http://www.piratamadrid.es
  • 45. Not available on the live web
  • 46. Making sense of it all • Web Archiving life cycle /model • Internet Archive future objectives – Social Media – Distributed Content – Visualization and analytical tools for more useful interaction – Search – Mobile platforms – Enhanced Researcher Access
  • 47. Web Archiving Life Cycle Model Web Archiving Life Cycle Model white paper available: http://www.archive-it.org/publications
  • 48. Breaking down the life cycle Outer layer: • Vision and Objectives • Resources and Workflow • Access / Use / Reuse. • Preservation • Risk Management Inner Circle: • Appraisal and Selection. • Scoping • Data Capture • Storage and Organization • Quality Assurance and Analysis Breaking down the life cycle
  • 49. Participant Poll • Are you confused yet? I hope not. Happy to answer questions!
  • 50. The importance of web archiving “As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations” Kalev H. Leetaru, University of Illinois
  • 51. Kristine Hanna, Director, Archiving Services Internet Archive kristine@archive.org Thank you!