Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introducing Web Archiving and WSDL Research Group

2,157 views

Published on

My talk to introduce Web Archiving and the Web Science and Digital Libraries Research Group to some invited students from India for a summer workshop in Old Dominion University, Norfolk, VA

Published in: Internet
  • Be the first to comment

Introducing Web Archiving and WSDL Research Group

  1. 1. Introducing Web Archiving and WSDL Research Group Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA)
  2. 2. About Me Sawood Alam Lexical Signature Web, Digital Library, Web Archiving, Ruby on Rails, PHP, HTML, CSS, JavaScript, ExtJS, Go, Urdu, RTL, Docker, and Linux. ● BTech, Jamia Millia Islamia, India, 2008 ● MSc, Old Dominion University, USA, 2013 ● PhD, Old Dominion University, USA, Current
  3. 3. She Calls Me Dad!
  4. 4. Agenda ● Archiving and Web archiving ● Purpose and importance ● Scope of the web archiving ● Issues and challenges ● Tools and techniques ● Memento: Time Travel for the Web ● Archive X-Ray ● Research opportunities in Web archiving ● Our WSDL Research Group
  5. 5. What is an Archive? ● Accumulation of historical records ● Long term storage and preservation ● Less frequently used ● Physical or digital
  6. 6. What is Web Archiving? ● Periodic snapshots of web pages ● Preserving important events on the Web ● Making archived content accessible
  7. 7. Why do We Care Archiving? Web contents decay rapidly! ● To preserve the history ● To tell a story ● For evidence ● For backup ● For personal satisfaction
  8. 8. Issues and Challenges ● Crawling ● Storage ● Retrieval ● Replay ● Accessibility ● Completeness ● Accuracy ● Credibility
  9. 9. Web Archiving Efforts ● Internet Archive ● Archive-It ● Wikipedia ● UK Web Archive ● Various national and non-profit archives ● Film, music and other multimedia archives ● Scholarly archives ● Personal archiving
  10. 10. Tools and Techniques ● Heritrix, PhantomJS, WGet, cURL ● OpenWayback, PyWB ● TimeTravel, MemGator ● CarbonDate, Warrick, Synchronicity ● Preserve Me! ● WARCreate,WAIL, Mink ● Browsertrix ● And many more...
  11. 11. Memento <http://example.com>; rel="original", <http://web.archive.org/web/20020120142510/http://example.com/>; rel="memento"; datetime="Sun, 20 Jan 2002 14:25:10 GMT", <http://web.archive.org/web/20020328012821/http://www.example.com/>; rel="memento"; datetime="Thu, 28 Mar 2002 01:28:21 GMT", <http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>; rel="memento"; datetime="Sat, 03 Aug 2002 08:05:44 GMT", <http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>; rel="memento"; datetime="Sun, 13 Dec 2009 01:50:14 GMT",
  12. 12. Archive X-Ray! ● How much of the Web is archived? ● Profiling various archive services ● Predicting what they contain ● Routing Memento aggregator queries
  13. 13. MemGator https://github.com/oduwsdl/memgator
  14. 14. MemGator http://memgator.cs.odu.edu:1208/
  15. 15. Memento Aggregator
  16. 16. Memento Aggregator
  17. 17. Memento Aggregator
  18. 18. Memento Aggregator
  19. 19. Memento Aggregator
  20. 20. Memento Aggregator
  21. 21. From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is Bad
  22. 22. Memento Routing
  23. 23. Long Tail of Archives
  24. 24. While the IA was Down... $ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016
  25. 25. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos ● Provides statistics about the holdings ● Small in size and publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things com,cnn)/ {“frequency”: 40, “spread”: 2} uk,co,bbc)/ {“frequency”: 20, “spread”: 1} com,usatoday)/ {“frequency”: 5, “spread”: 1}
  26. 26. Research Opportunities ● Information retrieval ● Information visualization ● Client and server side archiving ● Archiving dynamic content ● Distributed archiving ● Discovering alternate long term archiving techniques ● Predicting “Important” events on the Web and archiving them timely
  27. 27. Web Science and Digital Libraries Research Group ws-dl.cs.odu.edu ws-dl.blogspot.com @WebSciDL github.com/oduwsdl flickr.com/photos/124419986@N07
  28. 28. ODU Sailing Center
  29. 29. WSDL Feast
  30. 30. WSDL Whiteboards
  31. 31. WSDL Surprise
  32. 32. WSDL Ping Pong Table
  33. 33. WSDL Travels
  34. 34. Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA) salam@cs.odu.edu ibnesayeed@gmail.com @ibnesayeed www.cs.odu.edu/~salam

×