Profiling Web Archives
Sawood Alam and Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar Shankar
Los Alamos National Laboratory, Los Alamos, NM
David S. H. Rosenthal
Stanford University Libraries, Stanford, CA
Supported in part by the International Internet Preservation Consortium (IIPC)
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Long Tail of Archives
Long Tail of Archives
● 400B+ web pages at IA do not cover
everything
● Top three archives after IA produce full
TimeMap 52% of the time (AlSum et al, TPDL 2013)
● Targeted crawls
● Special focus archives
● Restricted resources
● Private archives
Archive Profile
● High-level summary of an archive
● Predicts presence of mementos of a URI-R
in an archive
● Provides various statistics about the
holdings
● Small in size
● Publicly available
● Easy to update and partially patch
● Useful for Memento query routing and other
things
Available Profiling Resources
● Client request
● Archive response
● Archive index (CDX files)
A Client Request
An Archive Response
A CDX Snippet
Profiling Strategies
● Complete URI-R Profiling (1 URI-R = 1 Profile Key)
○ bbc.co.uk/images/logo.png?w=90
○ cnn.com/2014/03/15/?id=128734
● TLD-only Profiling (1 TLD = 1 Profile Key)
○ com)/
○ uk)/
● Middle Ground
○ uk,co)/
○ uk,co,bbc)/images
○ uk,co,bbc)/0/2/1
○ com,cnn)/ 201309 ar
Frequency Measurements
CDXJ Serialization
URI-Key Generation
Profile Merging
Base profile
New profile
Merged profile
Dataset
● Three archives
● Four sample query sets
● 23 profiles for each archive and sample set
Archives
Archive URI-Rs URI-Ms Size
Archive-It 1.9B 5.3B 1.8TB
UKWA 0.7B 1.7B 0.5TB
Stanford 12M 25M 8.3GB
Sample Query Sets
Sample In Archive-It In UKWA In Stanford
DMOZ 4.097% 1.912% 0.034%
MementoProxy 4.182% 0.179% 0.046%
IAWayback 3.716% 0.231% 0.039%
UKWayback 0.108% 0.034% 0.002%
Sample Size: 1M URIs Each
Evaluation
● Relate CDX Size, URI-M, URI-R, and URI-
Key
● Analyze profile growth
● Estimate Relative Cost
● Evaluate Routing Precision vs. Relative Cost
CDX Size vs URI-M (UKWA 10 Years)
Alpha: 175 bytes per CDX line
URI-M vs URI-R (UKWA 10 Years)
Gamma: 2.46
K : 2.686
Beta: 0.911
Space Cost (UKWA 7 Years)
Phi: 8.5e-07 -- 0.70583
Time Cost (UKWA 7 Years)
Tau: 5.7e-05 -- 6.2e-05
CDX: 45GB
URI-Ms: 181M
URI-Rs: 96M
Time: 3 hours
Resource Requirement
Archive-It
UKWA
Stanford
Cost vs Precision
Group Cost Precision
G1 (H1P0/TLD) Bound by # of TLDs < 0.05
G2 (H3P0, DDom,
DSub, DPth, DQry)
< 0.01 ≈ 2 * G1
G3 (DIni) ≈ 2 * G2 ≈ (3--4) * G1
G4 (HxP1) ≈ 5 * G3 ≈ (5--7) * G1
G5 (Higher HmPn) 0.4 -- 0.7 Not Explored
G6 (URIR) 1.0 1.0
Future Work
● Generating sample URI sets
● Profiling via sampling
● Language profiles
● Evaluation of combination profiles such as
URI-Key along with Datetime
● Profiles for usage other than Memento
routing, such as, site classification based
profiles (e.g., news, wiki, social media, blog
etc.)
Conclusions
● Generated profiles with different policies for
two archives
● Examined cost-precision tradeoffs of various
policies
● Related CDX Size, URI-M, URI-R, and URI-
Key
● Gained up to 22% routing precision with
<5% relative cost without any false negatives
● Code @ GitHub:/oduwsdl/archive_profiler

TPDL 2015 - Profiling Web Archives

  • 1.
    Profiling Web Archives SawoodAlam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar Shankar Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the International Internet Preservation Consortium (IIPC)
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Long Tail ofArchives
  • 9.
    Long Tail ofArchives ● 400B+ web pages at IA do not cover everything ● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013) ● Targeted crawls ● Special focus archives ● Restricted resources ● Private archives
  • 10.
    Archive Profile ● High-levelsummary of an archive ● Predicts presence of mementos of a URI-R in an archive ● Provides various statistics about the holdings ● Small in size ● Publicly available ● Easy to update and partially patch ● Useful for Memento query routing and other things
  • 11.
    Available Profiling Resources ●Client request ● Archive response ● Archive index (CDX files)
  • 12.
  • 13.
  • 14.
  • 15.
    Profiling Strategies ● CompleteURI-R Profiling (1 URI-R = 1 Profile Key) ○ bbc.co.uk/images/logo.png?w=90 ○ cnn.com/2014/03/15/?id=128734 ● TLD-only Profiling (1 TLD = 1 Profile Key) ○ com)/ ○ uk)/ ● Middle Ground ○ uk,co)/ ○ uk,co,bbc)/images ○ uk,co,bbc)/0/2/1 ○ com,cnn)/ 201309 ar
  • 16.
  • 17.
  • 18.
  • 19.
    Profile Merging Base profile Newprofile Merged profile
  • 20.
    Dataset ● Three archives ●Four sample query sets ● 23 profiles for each archive and sample set
  • 21.
    Archives Archive URI-Rs URI-MsSize Archive-It 1.9B 5.3B 1.8TB UKWA 0.7B 1.7B 0.5TB Stanford 12M 25M 8.3GB
  • 22.
    Sample Query Sets SampleIn Archive-It In UKWA In Stanford DMOZ 4.097% 1.912% 0.034% MementoProxy 4.182% 0.179% 0.046% IAWayback 3.716% 0.231% 0.039% UKWayback 0.108% 0.034% 0.002% Sample Size: 1M URIs Each
  • 23.
    Evaluation ● Relate CDXSize, URI-M, URI-R, and URI- Key ● Analyze profile growth ● Estimate Relative Cost ● Evaluate Routing Precision vs. Relative Cost
  • 24.
    CDX Size vsURI-M (UKWA 10 Years) Alpha: 175 bytes per CDX line
  • 25.
    URI-M vs URI-R(UKWA 10 Years) Gamma: 2.46 K : 2.686 Beta: 0.911
  • 26.
    Space Cost (UKWA7 Years) Phi: 8.5e-07 -- 0.70583
  • 27.
    Time Cost (UKWA7 Years) Tau: 5.7e-05 -- 6.2e-05 CDX: 45GB URI-Ms: 181M URI-Rs: 96M Time: 3 hours
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    Cost vs Precision GroupCost Precision G1 (H1P0/TLD) Bound by # of TLDs < 0.05 G2 (H3P0, DDom, DSub, DPth, DQry) < 0.01 ≈ 2 * G1 G3 (DIni) ≈ 2 * G2 ≈ (3--4) * G1 G4 (HxP1) ≈ 5 * G3 ≈ (5--7) * G1 G5 (Higher HmPn) 0.4 -- 0.7 Not Explored G6 (URIR) 1.0 1.0
  • 33.
    Future Work ● Generatingsample URI sets ● Profiling via sampling ● Language profiles ● Evaluation of combination profiles such as URI-Key along with Datetime ● Profiles for usage other than Memento routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
  • 34.
    Conclusions ● Generated profileswith different policies for two archives ● Examined cost-precision tradeoffs of various policies ● Related CDX Size, URI-M, URI-R, and URI- Key ● Gained up to 22% routing precision with <5% relative cost without any false negatives ● Code @ GitHub:/oduwsdl/archive_profiler