Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TPDL 2015 - Profiling Web Archives

2,230 views

Published on

Web Archive Profiling projected funded by IIPC presented at TPDL 2015 in Pozna, Poland.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

TPDL 2015 - Profiling Web Archives

  1. 1. Profiling Web Archives Sawood Alam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar Shankar Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the International Internet Preservation Consortium (IIPC)
  2. 2. Memento Aggregator
  3. 3. Memento Aggregator
  4. 4. Memento Aggregator
  5. 5. Memento Aggregator
  6. 6. Memento Aggregator
  7. 7. Memento Aggregator
  8. 8. Long Tail of Archives
  9. 9. Long Tail of Archives ● 400B+ web pages at IA do not cover everything ● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013) ● Targeted crawls ● Special focus archives ● Restricted resources ● Private archives
  10. 10. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos of a URI-R in an archive ● Provides various statistics about the holdings ● Small in size ● Publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things
  11. 11. Available Profiling Resources ● Client request ● Archive response ● Archive index (CDX files)
  12. 12. A Client Request
  13. 13. An Archive Response
  14. 14. A CDX Snippet
  15. 15. Profiling Strategies ● Complete URI-R Profiling (1 URI-R = 1 Profile Key) ○ bbc.co.uk/images/logo.png?w=90 ○ cnn.com/2014/03/15/?id=128734 ● TLD-only Profiling (1 TLD = 1 Profile Key) ○ com)/ ○ uk)/ ● Middle Ground ○ uk,co)/ ○ uk,co,bbc)/images ○ uk,co,bbc)/0/2/1 ○ com,cnn)/ 201309 ar
  16. 16. Frequency Measurements
  17. 17. CDXJ Serialization
  18. 18. URI-Key Generation
  19. 19. Profile Merging Base profile New profile Merged profile
  20. 20. Dataset ● Three archives ● Four sample query sets ● 23 profiles for each archive and sample set
  21. 21. Archives Archive URI-Rs URI-Ms Size Archive-It 1.9B 5.3B 1.8TB UKWA 0.7B 1.7B 0.5TB Stanford 12M 25M 8.3GB
  22. 22. Sample Query Sets Sample In Archive-It In UKWA In Stanford DMOZ 4.097% 1.912% 0.034% MementoProxy 4.182% 0.179% 0.046% IAWayback 3.716% 0.231% 0.039% UKWayback 0.108% 0.034% 0.002% Sample Size: 1M URIs Each
  23. 23. Evaluation ● Relate CDX Size, URI-M, URI-R, and URI- Key ● Analyze profile growth ● Estimate Relative Cost ● Evaluate Routing Precision vs. Relative Cost
  24. 24. CDX Size vs URI-M (UKWA 10 Years) Alpha: 175 bytes per CDX line
  25. 25. URI-M vs URI-R (UKWA 10 Years) Gamma: 2.46 K : 2.686 Beta: 0.911
  26. 26. Space Cost (UKWA 7 Years) Phi: 8.5e-07 -- 0.70583
  27. 27. Time Cost (UKWA 7 Years) Tau: 5.7e-05 -- 6.2e-05 CDX: 45GB URI-Ms: 181M URI-Rs: 96M Time: 3 hours
  28. 28. Resource Requirement
  29. 29. Archive-It
  30. 30. UKWA
  31. 31. Stanford
  32. 32. Cost vs Precision Group Cost Precision G1 (H1P0/TLD) Bound by # of TLDs < 0.05 G2 (H3P0, DDom, DSub, DPth, DQry) < 0.01 ≈ 2 * G1 G3 (DIni) ≈ 2 * G2 ≈ (3--4) * G1 G4 (HxP1) ≈ 5 * G3 ≈ (5--7) * G1 G5 (Higher HmPn) 0.4 -- 0.7 Not Explored G6 (URIR) 1.0 1.0
  33. 33. Future Work ● Generating sample URI sets ● Profiling via sampling ● Language profiles ● Evaluation of combination profiles such as URI-Key along with Datetime ● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
  34. 34. Conclusions ● Generated profiles with different policies for two archives ● Examined cost-precision tradeoffs of various policies ● Related CDX Size, URI-M, URI-R, and URI- Key ● Gained up to 22% routing precision with <5% relative cost without any false negatives ● Code @ GitHub:/oduwsdl/archive_profiler

×