Site story wadl2013

2,176 views
2,059 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,176
On SlideShare
0
From Embeds
0
Number of Embeds
1,354
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Site story wadl2013

  1. 1. WADL 2013 July 25-26th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle jbrunelle@cs.odu.edu
  2. 2. WADL 2013 July 25-26th Indianapolis, IN LANL SiteStory Teamlead developer
  3. 3. WADL 2013 July 25-26th Indianapolis, IN Archiving - the traditional way • Actively crawl the web • For example, using Heritrix
  4. 4. WADL 2013 July 25-26th Indianapolis, IN • Issues with crawler based archiving: • Request can be rejected (robots.txt, user-agent, IP) • Can be deceived (geo-location, user-agent) • Can be trapped (crawl my calendar!) • Requires constant and massive bandwidth • Implied timing problem, when to crawl? Archiving - the traditional way
  5. 5. WADL 2013 July 25-26th Indianapolis, IN Timing problem: • Update 1 viewed but not archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way
  6. 6. WADL 2013 July 25-26th Indianapolis, IN Archiving - the SiteStory way • Transactional Web archiving • Archive accepts HTTP transaction between browser and server
  7. 7. WADL 2013 July 25-26th Indianapolis, IN Timing problem: • Update 1 viewed and archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way
  8. 8. WADL 2013 July 25-26th Indianapolis, IN
  9. 9. WADL 2013 July 25-26th Indianapolis, IN • Challenges with transactional archiving: • To be archived server has to cooperate • Transfer data to archive, batch mode or real-time • Archive must trust transmission to be authentic • Resources from external servers have to be archived out-of-band • Deduplication challenges • Alias: different URI, same response • Conneg: same URI, different response • Determine “significant” content change Archiving - the SiteStory way
  10. 10. WADL 2013 July 25-26th Indianapolis, IN SiteStory Status Quo • mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request • not for POST, DELETE, etc • for HTTP response codes 200, 302, 303 • Client IP can be included in stored headers, configurable • Header info stored in BerkeleyDB, response body in FS • Dedup via hash(body) • Offloading content as WARC files possible (read: recommended)
  11. 11. WADL 2013 July 25-26th Indianapolis, IN To Appear: TPDL 2013 • SiteStory benchmark with ab&wget o ApacheBench (ab): server stress test tool o wget: Web page download - All content: -p • Local network • Negligible difference between SiteStory and No SiteStory
  12. 12. WADL 2013 July 25-26th Indianapolis, IN Re-executed on testbed ws-dl-03.cs.odu.edu x99 ,… , , megalodon.lanl.gov @AWS
  13. 13. WADL 2013 July 25-26th Indianapolis, IN Testing with ab
  14. 14. WADL 2013 July 25-26th Indianapolis, IN Testing with wget
  15. 15. WADL 2013 July 25-26th Indianapolis, IN Round Trip Time -- Distributed
  16. 16. WADL 2013 July 25-26th Indianapolis, IN Results • Distributed: Higher variance • Increased delay due to network • On vs. Off Comparison still comparable • Viable solution without crippling service
  17. 17. WADL 2013 July 25-26th Indianapolis, IN SiteStory Installation • Apache module mod_sitestory • Option to exclude a list of directories • SiteStory Web Archive • Trivial for existing Tomcat environments • Tanuki Java wrapper (stand-alone) available • Configure, open ports, go! Or…
  18. 18. WADL 2013 July 25-26th Indianapolis, IN SiteStoryTestbed We have a SiteStory Web Archive installed for you! 1. Install and configure mod_sitestory 2. Send an email containing: 1. Your contact info 2. Web server IP address 3. Server domain name used 3. Happy Sitestory’ing! mailto: SiteStory-Testbed@googlegroups.com http://mementoweb.github.io/SiteStory/
  19. 19. WADL 2013 July 25-26th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle jbrunelle@cs.odu.edu

×