Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Web Archive Profiling
For
Efficient Memento Aggregation
Sawood Alam
Old Dominion University, Norfolk, Virginia - 23529
Adv...
Motivation
Motivation
Motivation
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patri...
Availability and Overlap
● Archives are sparse
● Broadcasting is wasteful, both clients and archives suffer
Memento Routing
Routing Pros & Cons
● Pros
○ Minimizes traffic and resources consumption
○ Improves throughput
● Cons
○ Upfront profile ma...
Why Small Archives Matter?
Why Small Archives Matter?
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
Time...
While the IA was Down...
$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c
2 2002
1 2005
1 2008
6 2009
67...
Research Questions
● What do individual web archives hold?
● How much do we need to know about an
archive’s holdings?
● Wh...
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R in
an archive
● Provides var...
Profiling Policies
● Complete URI-R Profiling (1 URI-R = 1 Profile Key)
○ bbc.co.uk/images/logo.png?w=90
○ cnn.com/2014/03...
Available Profiling Resources
Client request
Archive Response
CDX Records
Profiling Strategies
● CDX Profiling
● Fulltext Search Profiling
● Sample URI Profiling
● Response Cache Profiling
Sample Profile
Probability Rank
Archives
Archive URI-Rs URI-Ms Index Size
Archive-It 1.9B 5.3B 1.8TB
UKWA 0.7B 1.7B 0.5TB
Stanford 12M 25M 8.3GB
Sample Query Sets
Sample
(1M URIs Each)
In
Archive-It
In
UKWA
In
Stanford
Union
{AIT, UK,
SU}
DMOZ 4.097% 3.594% 0.034% 7....
Evaluation
● Generate profiles with 23 policies
● Relate CDX Size, URI-M, URI-R, and URI-Key
● Analyze profile growth
● Es...
Resource Requirement
CDX Size vs URI-M (UKWA 10 Years)
Alpha: 175 bytes per CDX line
URI-M vs URI-R (UKWA 10 Years)
Gamma: 2.46
K : 2.686
Beta: 0.911
Space Cost (UKWA 7 Years)
Phi: 8.5e-07 -- 0.70583
Time Cost (UKWA 7 Years)
Tau: 5.7e-05 -- 6.2e-05
CDX: 45GB
URI-Ms: 181M
URI-Rs: 96M
Time: 3 hours
Archive-It
Fulltext Search Cost
Partial Knowledge
Cost vs Accuracy
Group Policies Cost Accuracy
G1 H1P0/TLD Bound by # of TLDs ≈ 0.01
G2
H3P0, DDom, DSub,
DPth, DQry
< 0.01...
Work Plan
✓ Baseline Profiling Through CDX Files
✓ Profile Serialization
✓ Fulltext Search Profiling
✓ Sample URI Dataset
...
Publications
TPDL15 Web Archive Profiling Through CDX Summarization
TCDL15 Profiling Web Archives - For Efficient Memento ...
Future Work
● Language profiles
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Utilize archive...
Conclusions
● Generated profiles with different policies for three archives
● Examined cost-precision tradeoffs of various...
Upcoming SlideShare
Loading in …5
×

JCDL 2016 Doctoral Consortium - Web Archive Profiling

1,092 views

Published on

Web Archive Profiling presentation at JCDL 2016 Doctoral Consortium

Published in: Science
  • Be the first to comment

  • Be the first to like this

JCDL 2016 Doctoral Consortium - Web Archive Profiling

  1. 1. Web Archive Profiling For Efficient Memento Aggregation Sawood Alam Old Dominion University, Norfolk, Virginia - 23529 Advisor: Michael L. Nelson Doctoral Consortium JCDL’16 June 19, 2016 Supported in part by the International Internet Preservation Consortium (IIPC)
  2. 2. Motivation
  3. 3. Motivation
  4. 4. Motivation
  5. 5. Memento Aggregator
  6. 6. Memento Aggregator
  7. 7. Memento Aggregator
  8. 8. Memento Aggregator
  9. 9. Memento Aggregator
  10. 10. Memento Aggregator
  11. 11. From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http: //oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is Bad
  12. 12. Availability and Overlap ● Archives are sparse ● Broadcasting is wasteful, both clients and archives suffer
  13. 13. Memento Routing
  14. 14. Routing Pros & Cons ● Pros ○ Minimizes traffic and resources consumption ○ Improves throughput ● Cons ○ Upfront profile maintenance cost ○ May miss Mementos (false negatives)
  15. 15. Why Small Archives Matter?
  16. 16. Why Small Archives Matter? ● 400B+ web pages at IA do not cover everything ● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013) ● Targeted crawls ● Special focus archives ● Restricted resources ● Private archives ● Censorship
  17. 17. While the IA was Down... $ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016
  18. 18. Research Questions ● What do individual web archives hold? ● How much do we need to know about an archive’s holdings? ● What is the optimal level of summarization for better accuracy and increased freshness? ● What are various ways to learn about archives’ holdings? ● How to store and update archives’ profiles to efficiently scale?
  19. 19. Archive Profile ● High-level summary of an archive ● Predicts presence of mementos of a URI-R in an archive ● Provides various statistics about the holdings ● Small in size ● Publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things
  20. 20. Profiling Policies ● Complete URI-R Profiling (1 URI-R = 1 Profile Key) ○ bbc.co.uk/images/logo.png?w=90 ○ cnn.com/2014/03/15/?id=128734 ● TLD-only Profiling (1 TLD = 1 Profile Key) ○ com)/ ○ uk)/ ● Middle Ground ○ uk,co)/ ○ uk,co,bbc)/images ○ uk,co,bbc)/0/2/1 ○ com,cnn)/ 201309 ar
  21. 21. Available Profiling Resources Client request Archive Response CDX Records
  22. 22. Profiling Strategies ● CDX Profiling ● Fulltext Search Profiling ● Sample URI Profiling ● Response Cache Profiling
  23. 23. Sample Profile
  24. 24. Probability Rank
  25. 25. Archives Archive URI-Rs URI-Ms Index Size Archive-It 1.9B 5.3B 1.8TB UKWA 0.7B 1.7B 0.5TB Stanford 12M 25M 8.3GB
  26. 26. Sample Query Sets Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union {AIT, UK, SU} DMOZ 4.097% 3.594% 0.034% 7.575% MementoProxy 4.182% 0.408% 0.046% 4.527% IAWayback 3.716% 0.519% 0.039% 4.165% UKWayback 0.108% 0.034% 0.002% 0.134%
  27. 27. Evaluation ● Generate profiles with 23 policies ● Relate CDX Size, URI-M, URI-R, and URI-Key ● Analyze profile growth ● Estimate Relative Cost ● Evaluate Routing Efficiency
  28. 28. Resource Requirement
  29. 29. CDX Size vs URI-M (UKWA 10 Years) Alpha: 175 bytes per CDX line
  30. 30. URI-M vs URI-R (UKWA 10 Years) Gamma: 2.46 K : 2.686 Beta: 0.911
  31. 31. Space Cost (UKWA 7 Years) Phi: 8.5e-07 -- 0.70583
  32. 32. Time Cost (UKWA 7 Years) Tau: 5.7e-05 -- 6.2e-05 CDX: 45GB URI-Ms: 181M URI-Rs: 96M Time: 3 hours
  33. 33. Archive-It
  34. 34. Fulltext Search Cost
  35. 35. Partial Knowledge
  36. 36. Cost vs Accuracy Group Policies Cost Accuracy G1 H1P0/TLD Bound by # of TLDs ≈ 0.01 G2 H3P0, DDom, DSub, DPth, DQry < 0.01 ≈ 0.78 G3 DIni ≈ 2 * G2 ≈ 0.88 G4 HxP1 ≈ 5 * G3 ≈ 0.94 G5 Higher HmPn 0.4 -- 0.7 Not Explored G6 URIR 1.0 1.0
  37. 37. Work Plan ✓ Baseline Profiling Through CDX Files ✓ Profile Serialization ✓ Fulltext Search Profiling ✓ Sample URI Dataset ➢ Instrumenting Memento Aggregator ➢ Multidimensional Profiling
  38. 38. Publications TPDL15 Web Archive Profiling Through CDX Summarization TCDL15 Profiling Web Archives - For Efficient Memento Query Routing IJDL16 Web Archive Profiling Through CDX Summarization JCDL16 Poster: MemGator - A Portable Concurrent Memento Aggregator TPDL16 Web Archive Profiling Through Fulltext Search RFC Object Resource Stream (ORS) and CDX-JSON (CDXJ) Formats C4LJ MemGator - A Portable Concurrent Memento Aggregator Architecture JCDL17 Scalable, Maintainable, and Extensible Web Archive Profile Serialization for Efficient Lookup JCDL17 URI, Time, and Language Profiling from Live Archives via URI Sampling and Fulltex Search SIGIR17 Memento Aggregator Routing Based on Probability Distribution of Memento Availability with Archive Profiles IJDL17 Archive X-Ray - Web Archive Profiling for Efficient Memento Aggregation
  39. 39. Future Work ● Language profiles ● Evaluation of combination profiles such as URI-Key along with Datetime ● Utilize archive profile to generate rank ordered list of archive ● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
  40. 40. Conclusions ● Generated profiles with different policies for three archives ● Examined cost-precision tradeoffs of various policies ● Related CDX Size, URI-M, URI-R, and URI-Key ● Gained up to 80% routing accuracy with <1% relative cost while maintaining 0.9 recall

×