Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Archiving: A Brief Introduction

2,675 views

Published on

An invited talk about the Web Archiving given in the SPiRIT Research Group at Magdeburg-Stendal University, Germany.

Published in: Internet
  • Be the first to comment

Web Archiving: A Brief Introduction

  1. 1. Web Archiving A Brief Introduction Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA)
  2. 2. About Me Sawood Alam Lexical Signature Web, Digital Library, Web Archiving, Ruby on Rails, PHP, XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux. ● BTech, Jamia Millia Islamia, India, 2008 ● MSc, Old Dominion University, USA, 2013 ● PhD, Old Dominion University, USA, Current
  3. 3. She Calls Me Dad!
  4. 4. Agenda ● Archiving and Web archiving ● Purpose and importance ● Scope of the web archiving ● Issues and challenges ● Tools and techniques ● Memento: Time Travel for the Web ● Archive X-Ray ● Research opportunities in Web archiving ● Our WSDL Research Group
  5. 5. What is an Archive? ● Accumulation of historical records ● Long term storage and preservation ● Less frequently used ● Physical or digital
  6. 6. What is Web Archiving? ● Periodic snapshots of web pages ● Preserving important events on the Web ● Making archived content accessible
  7. 7. Why do We Care Archiving? Web contents decay rapidly! ● To preserve the history ● To tell a story ● For evidence ● For backup ● For personal satisfaction
  8. 8. Issues and Challenges ● Crawling ● Storage ● Retrieval ● Replay ● Accessibility ● Completeness ● Accuracy ● Credibility
  9. 9. Web Archiving Efforts ● Internet Archive ● Archive-It ● Wikipedia ● UK Web Archive ● Various national and non-profit archives ● Film, music and other multimedia archives ● Scholarly archives ● Personal archiving
  10. 10. Tools and Techniques ● Heritrix, PhantomJS, WGet, cURL ● OpenWayback, PyWB ● TimeTravel, MemGator ● CarbonDate, Warrick, Synchronicity ● Preserve Me! ● WARCreate,WAIL, Mink ● Browsertrix ● And many more...
  11. 11. Memento <http://example.com>; rel="original", <http://web.archive.org/web/20020120142510/http://example.com/>; rel="memento"; datetime="Sun, 20 Jan 2002 14:25:10 GMT", <http://web.archive.org/web/20020328012821/http://www.example.com/>; rel="memento"; datetime="Thu, 28 Mar 2002 01:28:21 GMT", <http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>; rel="memento"; datetime="Sat, 03 Aug 2002 08:05:44 GMT", <http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>; rel="memento"; datetime="Sun, 13 Dec 2009 01:50:14 GMT",
  12. 12. Archive X-Ray! ● How much of the Web is archived? ● Profiling various archive services ● Predicting what they contain ● Routing Memento aggregator queries
  13. 13. Memento Aggregator
  14. 14. Memento Aggregator
  15. 15. Memento Aggregator
  16. 16. Memento Aggregator
  17. 17. Memento Aggregator
  18. 18. Memento Aggregator
  19. 19. Long Tail of Archives
  20. 20. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos ● Provides statistics about the holdings ● Small in size and publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things com,cnn)/ {“frequency”: 40, “spread”: 2} uk,co,bbc)/ {“frequency”: 20, “spread”: 1} com,usatoday)/ {“frequency”: 5, “spread”: 1}
  21. 21. Research Opportunities ● Information retrieval ● Information visualization ● Client and server side archiving ● Archiving dynamic content ● Distributed archiving ● Discovering alternate long term archiving techniques ● Predicting “Important” events on the Web and archiving them timely
  22. 22. Web Science and Digital Libraries Research Group ws-dl.cs.odu.edu ws-dl.blogspot.com @WebSciDL github.com/oduwsdl flickr.com/photos/124419986@N07
  23. 23. WSDL Research Group
  24. 24. WSDL Research Group
  25. 25. WSDL Research Group
  26. 26. WSDL Research Group
  27. 27. WSDL Research Group
  28. 28. Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA) salam@cs.odu.edu ibnesayeed@gmail.com @ibnesayeed www.cs.odu.edu/~salam

×