WADL 2013
July 25-26th Indianapolis, IN
Martin Klein
@mart1nkle1n
martinklein0815@gmail.com
SiteStory
Archiving Done Diffe...
WADL 2013
July 25-26th Indianapolis, IN
LANL SiteStory Teamlead developer
WADL 2013
July 25-26th Indianapolis, IN
Archiving - the traditional way
• Actively crawl the web
• For example, using Heri...
WADL 2013
July 25-26th Indianapolis, IN
• Issues with crawler based archiving:
• Request can be rejected (robots.txt, user...
WADL 2013
July 25-26th Indianapolis, IN
Timing problem:
• Update 1 viewed but not archived
t1
R
created
t2
browser
visit1
...
WADL 2013
July 25-26th Indianapolis, IN
Archiving - the SiteStory way
• Transactional Web archiving
• Archive accepts HTTP...
WADL 2013
July 25-26th Indianapolis, IN
Timing problem:
• Update 1 viewed and archived
t1
R
created
t2
browser
visit1
t3
c...
WADL 2013
July 25-26th Indianapolis, IN
WADL 2013
July 25-26th Indianapolis, IN
• Challenges with transactional archiving:
• To be archived server has to cooperat...
WADL 2013
July 25-26th Indianapolis, IN
SiteStory Status Quo
• mod_sitestory sends HTTP PUT to SiteStory Web
Archive upon ...
WADL 2013
July 25-26th Indianapolis, IN
To Appear: TPDL 2013
• SiteStory benchmark with ab&wget
o ApacheBench (ab): server...
WADL 2013
July 25-26th Indianapolis, IN
Re-executed on testbed
ws-dl-03.cs.odu.edu
x99
,…
,
,
megalodon.lanl.gov
@AWS
WADL 2013
July 25-26th Indianapolis, IN
Testing with ab
WADL 2013
July 25-26th Indianapolis, IN
Testing with wget
WADL 2013
July 25-26th Indianapolis, IN
Round Trip Time -- Distributed
WADL 2013
July 25-26th Indianapolis, IN
Results
• Distributed: Higher variance
• Increased delay due to network
• On vs. O...
WADL 2013
July 25-26th Indianapolis, IN
SiteStory Installation
• Apache module mod_sitestory
• Option to exclude a list of...
WADL 2013
July 25-26th Indianapolis, IN
SiteStoryTestbed
We have a SiteStory Web Archive installed for you!
1. Install and...
WADL 2013
July 25-26th Indianapolis, IN
Martin Klein
@mart1nkle1n
martinklein0815@gmail.com
SiteStory
Archiving Done Diffe...
Upcoming SlideShare
Loading in...5
×

Site story wadl2013

1,799

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,799
On Slideshare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Site story wadl2013

  1. 1. WADL 2013 July 25-26th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle jbrunelle@cs.odu.edu
  2. 2. WADL 2013 July 25-26th Indianapolis, IN LANL SiteStory Teamlead developer
  3. 3. WADL 2013 July 25-26th Indianapolis, IN Archiving - the traditional way • Actively crawl the web • For example, using Heritrix
  4. 4. WADL 2013 July 25-26th Indianapolis, IN • Issues with crawler based archiving: • Request can be rejected (robots.txt, user-agent, IP) • Can be deceived (geo-location, user-agent) • Can be trapped (crawl my calendar!) • Requires constant and massive bandwidth • Implied timing problem, when to crawl? Archiving - the traditional way
  5. 5. WADL 2013 July 25-26th Indianapolis, IN Timing problem: • Update 1 viewed but not archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way
  6. 6. WADL 2013 July 25-26th Indianapolis, IN Archiving - the SiteStory way • Transactional Web archiving • Archive accepts HTTP transaction between browser and server
  7. 7. WADL 2013 July 25-26th Indianapolis, IN Timing problem: • Update 1 viewed and archived t1 R created t2 browser visit1 t3 crawler visit1 t4 R update1 t5 browser visit2 t6 R update2 Archiving - the traditional way
  8. 8. WADL 2013 July 25-26th Indianapolis, IN
  9. 9. WADL 2013 July 25-26th Indianapolis, IN • Challenges with transactional archiving: • To be archived server has to cooperate • Transfer data to archive, batch mode or real-time • Archive must trust transmission to be authentic • Resources from external servers have to be archived out-of-band • Deduplication challenges • Alias: different URI, same response • Conneg: same URI, different response • Determine “significant” content change Archiving - the SiteStory way
  10. 10. WADL 2013 July 25-26th Indianapolis, IN SiteStory Status Quo • mod_sitestory sends HTTP PUT to SiteStory Web Archive upon client’s GET request • not for POST, DELETE, etc • for HTTP response codes 200, 302, 303 • Client IP can be included in stored headers, configurable • Header info stored in BerkeleyDB, response body in FS • Dedup via hash(body) • Offloading content as WARC files possible (read: recommended)
  11. 11. WADL 2013 July 25-26th Indianapolis, IN To Appear: TPDL 2013 • SiteStory benchmark with ab&wget o ApacheBench (ab): server stress test tool o wget: Web page download - All content: -p • Local network • Negligible difference between SiteStory and No SiteStory
  12. 12. WADL 2013 July 25-26th Indianapolis, IN Re-executed on testbed ws-dl-03.cs.odu.edu x99 ,… , , megalodon.lanl.gov @AWS
  13. 13. WADL 2013 July 25-26th Indianapolis, IN Testing with ab
  14. 14. WADL 2013 July 25-26th Indianapolis, IN Testing with wget
  15. 15. WADL 2013 July 25-26th Indianapolis, IN Round Trip Time -- Distributed
  16. 16. WADL 2013 July 25-26th Indianapolis, IN Results • Distributed: Higher variance • Increased delay due to network • On vs. Off Comparison still comparable • Viable solution without crippling service
  17. 17. WADL 2013 July 25-26th Indianapolis, IN SiteStory Installation • Apache module mod_sitestory • Option to exclude a list of directories • SiteStory Web Archive • Trivial for existing Tomcat environments • Tanuki Java wrapper (stand-alone) available • Configure, open ports, go! Or…
  18. 18. WADL 2013 July 25-26th Indianapolis, IN SiteStoryTestbed We have a SiteStory Web Archive installed for you! 1. Install and configure mod_sitestory 2. Send an email containing: 1. Your contact info 2. Web server IP address 3. Server domain name used 3. Happy Sitestory’ing! mailto: SiteStory-Testbed@googlegroups.com http://mementoweb.github.io/SiteStory/
  19. 19. WADL 2013 July 25-26th Indianapolis, IN Martin Klein @mart1nkle1n martinklein0815@gmail.com SiteStory Archiving Done Differently http://mementoweb.github.io/SiteStory/ Justin F. Brunelle jbrunelle@cs.odu.edu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×