SlideShare a Scribd company logo
1 of 45
Download to read offline
Profiling Web Archives
memento
and
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Sawood Alam Michael L. Nelson
Los Alamos National Laboratory, Los Alamos, NM
Herbert Van de Sompel
Stanford University Libraries, Stanford, CA
David S. H. Rosenthal
Memento Aggregator
memento
Aggregates ~20 archives and counting
Only a few archives return good results
for any query
Time, network, and resource wastage
Query routing can be helpful
Long Tail Matters
400B+ web pages at IA do
not cover everything
Top three archives after IA
produce full TimeMap
52% of the time
Targeted crawls
Special focus archives
Restricted resources
Private archives
The Portuguese Web Archive and Memento unveil the first 
homepage of the Smithsonian Institution from May 1995... 
fb.me/3VAo6gEba
1:12 PM ­ 5 Jan 2015
    8   1
PortugueseWebArchive 
​@PT_WebArchive
 Follow
Dennis Ritchie's Homepage has been deleted: cm.bell­
labs.com/cm/cs/who/dmr/ ­ and the site has a robots.txt 
that blocks it from the Wayback.
2:37 PM ­ 22 Apr 2015
    76   23
Jason Scott 
​@textfiles
 Follow
Memento Workflow
Memento Workflow with Profile
Available Profiling Resources
Client request
Archive response
Archive index (CDX files)
A Client Request
Canonical URL
Accept-Datetime (optional)
Accept-Language (optional)
G E T / t i m e g a t e / h t t p : / / w w w . c n n . c o m / H T T P / 1 . 1
H o s t : m e m e n t o w e b . o r g
A c c e p t : t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l ; q = 0 . 9 , i m a g e / w e b p , * / * ; q = 0 . 8
A c c e p t - E n c o d i n g : g z i p , d e f l a t e , s d c h
A c c e p t - D a t e t i m e : S a t , 1 6 J u n 2 0 1 2 0 0 : 0 0 : 0 0 G M T
A c c e p t - L a n g u a g e : e n - U S , e n ; q = 0 . 8
C a c h e - C o n t r o l : m a x - a g e = 0
I f - M o d i f i e d - S i n c e : T h u , 2 3 A p r 2 0 1 5 1 6 : 5 1 : 5 0 G M T
I f - N o n e - M a t c h : " 7 f f 8 - 5 1 4 6 7 1 8 9 2 9 5 8 0 "
C o n n e c t i o n : k e e p - a l i v e
C o o k i e : _ _ u n a m = 3 4 c 3 c 7 d - 1 4 c e 9 1 7 c e 6 2 - 4 3 c 3 8 e 5 e - 7 . . .
U s e r - A g e n t : M o z i l l a / 5 . 0 L i n u x x 8 6 _ 6 4 C h r o m e / 4 2 . 0 . 2 3 1 1 . 9 0 . . .
An Archive Response
Canonical URL (known)
Memento-Datetime
Original Content-Language (optional)
H T T P / 1 . 1 2 0 0 O K
S e r v e r : T e n g i n e / 2 . 0 . 3
D a t e : S u n , 2 6 A p r 2 0 1 5 0 0 : 2 5 : 5 7 G M T
C o n t e n t - T y p e : t e x t / h t m l ; c h a r s e t = u t f - 8
C o n t e n t - L e n g t h : 8 5 9 4 5
C o n n e c t i o n : k e e p - a l i v e
s e t - c o o k i e : w a y b a c k _ s e r v e r = 3 7 ; D o m a i n = a r c h i v e . o r g ; P a t h = / ; E x p i r e s = T u e , 2 6 - M a
M e m e n t o - D a t e t i m e : S a t , 2 5 A p r 2 0 1 5 1 3 : 3 8 : 1 6 G M T
L i n k : ; r e l = " o r i g i n a l " , ; r e l = " t i m e m a p " ; t y p e = " a p p l i c a t i o n / l i n k - f o r m a t " ,
X - A r c h i v e - G u e s s e d - C h a r s e t : U T F - 8
X - A r c h i v e - O r i g - v i a : 1 . 1 v a r n i s h , 1 . 1 v a r n i s h , 1 . 1 v a r n i s h
X - A r c h i v e - O r i g - c o n t e n t - l a n g u a g e : e n
X - A r c h i v e - O r i g - x - c o n t e n t - t y p e - o p t i o n s : n o s n i f f
X - A r c h i v e - O r i g - v a r y : A c c e p t - E n c o d i n g , C o o k i e
X - A r c h i v e - O r i g - c o n t e n t - t y p e : t e x t / h t m l ; c h a r s e t = U T F - 8
X - A r c h i v e - O r i g - c a c h e - c o n t r o l : p r i v a t e , s - m a x a g e = 0 , m a x - a g e = 0 , m u s t - r e v a l i d a t e
X - A r c h i v e - O r i g - s e r v e r : A p a c h e
A CDX Snippet
Canonical URL
Memento Datetime
c n n . c o m / 2 0 0 8 0 2 2 6 1 9 3 7 5 7 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 2 Q 4 O Z S V K P Z M U F 3 6 U N 6
c n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J
c n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J
i . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 1 0 2 0 8 3 5 5 4 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l
i . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 4 0 4 1 7 2 9 1 3 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l
Complete URI-R Profiling
Sanderson et al. created a URIR profile for various
archives
Extracted every URI-R from all the CDX files
Gained complete knowledge of the holding of the
participating archives
Profiles were huge
Difficult to keep up-to-date
Misses URI-Rs added later in the archive
TLD-only Profiling
AlSum et al. created a TLD
profile for various archives
Collected statistics about
various archives on
various TLDs
Lightweight profiles
Lots of false-positives
All the ".com" queries will
be routed to an archive
that has only a few URI-Rs
with ".com" TLD
Middle Ground
Partial URI-Rs, such as:
Registered domain name
Complete domain name (along with any sub-domains)
Complete domain name and first few path segments
Registered domain name and counts of other segments
such as sub-domain, path, and query parameter
Combining above with other attributes such as Content-
Language and Memento-Datetime
Archive Profile
High-level digest of an archive
Predicts presence of mementos of a URI-R in an archive
Provides various statistics about the holdings
Small in size
Publicly available
Easy to update and partially patch
Useful for Memento query routing and other things
Structure
A r c h i v e m e t a d a t a
S t a t i s t i c s :
P r o f i l e t y p e s :
K e y s : F r e q u e n c y m e a s u r e m e n t s
Profile types
URI-R based
Complete URI-R
TLD only
URI-R hashes, such as:
Only first few segments of the URI-R (Sub-URI)
Registered domain name along with counts of other
segments (Segment-Digest)
Language
Datetime
Many more...
Keys
Depend on the profile type
Control the balance between profile size and details
U R I - R : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
T L D : " . u k "
S u b - U R I : " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s "
S e g - D i g e s t : " 0 / b b c . c o . u k / 4 "
L a n g u a g e : " e n - G B "
D a t e t i m e : " 2 0 1 4 0 3 " # Y Y Y Y M M
Frequency Measurements
Can have the same structure for all profile types
Flexible to choose the attribute set to be included
Affects the profile complexity
Predicts the presence of the mementos of a URI-R
" u k , c o , b b c ) / " :
u r i m :
m a x : 2
m i n : 1
t o t a l : 1 2 8
u r i r : 1 1 5
Horizontal and Vertical Holdings
" u k , c o , b b c ) / " :
u r i m :
m a x : 1 0 0
m i n : 1 0 0
t o t a l : 1 0 0
u r i r : 1
" u k , c o , b b c ) / " :
u r i m :
m a x : 1
m i n : 1
t o t a l : 1 0 0
u r i r : 1 0 0
" u k , c o , b b c ) / " :
u r i m :
m a x : 2 0
m i n : 5
t o t a l : 1 0 0
u r i r : 1 0
Sample Profile
- - -
" @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d "
" @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "
a b o u t :
a c c e s s p o i n t : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / "
m e c h a n i s m : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x "
n a m e : " U K W A 1 9 9 6 C o l l e c t i o n "
p r o f i l e _ u p d a t e d : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z "
s u b u r i _ c l a s s : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 "
m o r e _ m e t a _ d a t a : " . . . "
s t a t s :
l a n g u a g e :
" e n - U S " :
u r i m : { m a x : 1 3 , m i n : 1 , t o t a l : 4 7 5 2 9 }
u r i r : 2 5 6 2 1
" m o r e _ l a n g u a g e s " : " . . . "
s u b u r i :
" u k ) / " :
u r i m : { m a x : 8 , m i n : 1 , t o t a l : 9 3 2 4 3 2 }
u r i r : 8 6 7 8 1 7
" u k , c o ) / " :
u r i m : { m a x : 8 , m i n : 1 , t o t a l : 4 1 0 9 7 9 }
u r i r : 3 7 8 6 8 6
URI-R Based Profiles
URI-R preprocessing
Canonicalize
Apply SURT
Split segments
Extract registered domain
Count segments (sub-domain, path, query params)
Generate all Sub-URIs
Incrementally add segments from left-to-right
Only up to max host and path segments config
Create Segment-Digest with registered domain
Prefix sub-domain count
Suffix path and query params count
Key Generation
https://www.BBC.co.uk/images/Logo.png?width=200&height=80#f
Intermediate Values
{ c a n o n i c a l _ u r l : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " ,
s u r t _ u r l : " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " ,
r e g _ d o m a i n : " b b c . c o . u k " , p a t h _ i n i t i a l : " i " ,
s u b d o m a i n _ c o u n t : 1 , p a t h _ c o u n t : 2 , q u e r y _ p a r a m s _ c o u n t : 2 }
Sub-URI(H 3 P 1 )
[ " u k ) / " ,
" u k , c o ) / " ,
" u k , c o , b b c ) / " ,
" u k , c o , b b c ) / i m a g e s " ]
SegDigest( include_path_initial)
" 1 / b b c . c o . u k / i 4 "
Implementation
GitHub:
A python module to generate Sub-URIs from SURT
GitHub:
Various profile generation scripts
/oduwsdl/suburi_generator
/oduwsdl/archive_profiler
Canonicalization
Remove "http(s)", "www", and fragment of a URI
Downcase hostname
Remove some known query paras e.g., "jsessionid"
Sort query params by keys and values (secondary)
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
C a n o n i c a l i z e ( U R L )
# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
Sort-friendly URI Reordering
Transform (SURT)
Take canonical URL as input
Join hostname segments by commas in reverse order
Separate hostname and path by closing parenthesis
C a n _ U R L = " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S U R T ( C a n _ U R L )
# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
Sub-URI
Take SURT URL as input
Incrementally add segments from left-to-right one-by-one
Stop if hostname or path segment limit policy reaches
Return the list of all Sub-URIs
S U R T _ U R L = " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )
# = > [ " u k ) / " ,
# " u k , c o ) / " ,
# " u k , c o , b b c ) / " ,
# " u k , c o , b b c ) / i m a g e s " ]
URL to Sub-URI
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
C a n _ U R L = C a n o n i c a l i z e ( U R L )
# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S U R T _ U R L = S U R T ( C a n _ U R L )
# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
S u b _ U R I s = S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )
# = > [ " u k ) / " ,
# " u k , c o ) / " ,
# " u k , c o , b b c ) / " ,
# " u k , c o , b b c ) / i m a g e s " ]
Segment Count Digest
Extract registered domain name and initial letter of path
Count sub-domain and trailing (path + query) segments
Serialize as follows:
{ s u b d o m a i n _ c o u n t } / { r e g _ d o m a i n } / { p a t h _ i n i t i a l } ? { t r a i l i n g _ c o u n t }
U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "
S e g m e n t s = S e g m e n t i z e ( U R L )
# = > { r e g _ d o m a i n : " b b c . c o . u k " ,
# p a t h _ i n i t i a l : " i " ,
# s u b d o m a i n _ c o u n t : 1 ,
# p a t h _ c o u n t : 2 ,
# q u e r y _ p a r a m s _ c o u n t : 2 ,
# t r a i l i n g _ c o u n t : 4 }
S e g D i g e s t ( S e g m e n t s , p o l i c y = " e x c l u d e _ p a t h _ i n i t i a l " )
# = > " 1 / b b c . c o . u k / 4 "
S e g D i g e s t ( S e g m e n t s , p o l i c y = " i n c l u d e _ p a t h _ i n i t i a l " )
# = > " 1 / b b c . c o . u k / i 4 "
JSON Serialization
Can have complex nested
data structure
JSON-LD for linked data
No partial key lookup
Unsuitable for text
processing tools
Allows processing only
when fully loaded
A single malformed
character makes it
unparsable
Difficult to patch
{
" s u b u r i " : {
" u k ) / " : {
" u r i m " : {
" m a x " : 8 ,
" m i n " : 1 ,
" t o t a l " : 9 3 2 4 3 2
} ,
" u r i r " : 8 6 7 8 1 7
} ,
" u k , c o ) / " : {
" u r i m " : {
" m a x " : 8 ,
" m i n " : 1 ,
" t o t a l " : 4 1 0 9 7 9
} ,
" u r i r " : 3 7 8 6 8 6
} ,
" u k , c o , b b c ) / " : {
" u r i m " : {
" m a x " : 2 ,
" m i n " : 1 ,
" t o t a l " : 1 2 8
CDX-JSON Serialization
Fusion of CDX and JSON file formats
A key followed by strict single line JSON value
Unlike CDX, values can have arbitrary attributes
Text processing tool friendly
No single root node or single document restrictions
Enables binary search
Enables partial key lookup
Error resilient
@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "
@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "
@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " :
u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,
u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6
u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,
u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :
Merging
Only process new data to periodically update for
freshness
Parallel processing
Difficult to keep detailed measures with absolute values
Derived simple heuristic measures to predict presence of
mementos
Merging Example
Base Profile
c o m , c n n ) / { " u r i r _ s u m " : 3 0 , " s o u r c e s " : 1 } ,
u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 }
New Profile
c o m , c n n ) / { " u r i r _ s u m " : 1 0 , " s o u r c e s " : 1 } ,
c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
Merged Profile
c o m , c n n ) / { " u r i r _ s u m " : 4 0 , " s o u r c e s " : 2 } ,
u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 } ,
c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
Dataset
Two archives
Three sample query sets
Various profiles for each archive and sample set
Archives
Archive URI-Rs URI-Ms Size
Archive-It 1.9B 5.3B 1.8TB
UKWA 0.7B 1.7B 0.5TB
Sample Query Sets
Sample Size In Archive-It In UKWA
DMOZ 100,000 4,042 1,896
MementoProxy 100,000 4,222 193
IAWayback 100,000 3,999 275
Evaluation
Relate CDX Size, URI-M, URI-R, and Sub-URI
Analyze profile growth
Estimate Relative Cost
Evaluate Routing Precision vs. Relative Cost
Relative Cost =
|Keys in the Profile|
|URI-R in the Archive|
Routing Precision =
|URI-R Present in the Archive|
|URI-R Predicted by the Profile in Archive|
UKWA Dataset
Yearly data as seprate collections
Average CDX line size: 275 bytes
URI-M/URI-R ratio: 2.46
Accumulated URI-R Growth (UKWA)
Successive yearly data
was merged
Follows Heaps' Law
K = 3.897
β = 0.892
= KCr C
β
m
Sub-URI Key Growth (UKWA)
Slope of the fit line is the
Relative Cost for the
profile policy
Complete URI-R profile
has Relative Cost 1
Cost Analysis
Search Precision of Various Profiles
Search Precision wrt TLD-only profile
Double for H3P0
Five fold for HxP1
Segment-Digest is as good as H3P0
Relative Cost vs. Search Precision
Up to 22% routing precision with <5% Reltive Cost
<0.3% sample URIs from MementoProxy and IAWayback
logs present in UKWA
Shallow crawling of UKWA results in higher cost
Relative Profile Cost (UKWA)
Profile Cost Profile Cost Profile Cost
H1P0 3.2e-06 H3P2 0.26823 HxP2 0.38313
H2P0 0.00027 H3P3 0.37343 HxP3 0.53928
H2P1 0.00059 H4P0 0.01348 HxP4 0.63889
H2P2 0.00099 H5P0 0.01388 HxP5 0.71568
H3P0 0.00862 HxP0 0.01401 HxPx 0.83107
H3P1 0.11864 HxP1 0.16349 URIR 1.00000
Future Work
Generating sample URI sets
Profiling via sampling
Language profiles
Evaluation of combination profiles such as Sub-URI along
with Datetime
Profiles for usage other than Memento routing, such as,
Media-type profiles (e.g., images, pdf, audio etc.)
Site classification based profiles (e.g., news, wiki, social
media, blog etc.)
Conclusions
Generated profiles with different policies for two archives
Examined cost-accuracy trade-offs of various profiles
Related CDX Size, URI-M, URI-R, and Sub-URI
Gained up to 22% routing precision with <5% relative cost
without any false negatives
<5% of the queried URIs are present in each of the
individual archives
Implementation codes are available at:
GitHub:
GitHub:
/oduwsdl/suburi_generator
/oduwsdl/archive_profiler
Sawood Alam
@ibnesayeed

More Related Content

What's hot

Ceh v8 labs module 06 trojans and backdoors
Ceh v8 labs module 06 trojans and backdoorsCeh v8 labs module 06 trojans and backdoors
Ceh v8 labs module 06 trojans and backdoors
Mehrdad Jingoism
 
Modeling avengers – open source technology mix for saving the world econ fr
Modeling avengers – open source technology mix for saving the world econ frModeling avengers – open source technology mix for saving the world econ fr
Modeling avengers – open source technology mix for saving the world econ fr
Cédric Brun
 
Ceh v8 labs module 19 cryptography
Ceh v8 labs module 19 cryptographyCeh v8 labs module 19 cryptography
Ceh v8 labs module 19 cryptography
Mehrdad Jingoism
 
Maurizio_Taffone_Emerging_Security_Threats
Maurizio_Taffone_Emerging_Security_ThreatsMaurizio_Taffone_Emerging_Security_Threats
Maurizio_Taffone_Emerging_Security_Threats
Maurizio Taffone
 
TopicModelingNLPHandsOnML
TopicModelingNLPHandsOnMLTopicModelingNLPHandsOnML
TopicModelingNLPHandsOnML
Samir Aryamane
 
Hoja de vida jogc
Hoja de vida jogcHoja de vida jogc
Hoja de vida jogc
jogc62
 
From Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated DataFrom Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated Data
Gerard de Melo
 

What's hot (20)

Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Ceh v8 labs module 06 trojans and backdoors
Ceh v8 labs module 06 trojans and backdoorsCeh v8 labs module 06 trojans and backdoors
Ceh v8 labs module 06 trojans and backdoors
 
Modeling avengers – open source technology mix for saving the world econ fr
Modeling avengers – open source technology mix for saving the world econ frModeling avengers – open source technology mix for saving the world econ fr
Modeling avengers – open source technology mix for saving the world econ fr
 
PARASITIC COMPUTING: PROBLEMS AND ETHICAL CONSIDERATION
PARASITIC COMPUTING: PROBLEMS AND ETHICAL CONSIDERATIONPARASITIC COMPUTING: PROBLEMS AND ETHICAL CONSIDERATION
PARASITIC COMPUTING: PROBLEMS AND ETHICAL CONSIDERATION
 
Ceh v8 labs module 19 cryptography
Ceh v8 labs module 19 cryptographyCeh v8 labs module 19 cryptography
Ceh v8 labs module 19 cryptography
 
Maurizio_Taffone_Emerging_Security_Threats
Maurizio_Taffone_Emerging_Security_ThreatsMaurizio_Taffone_Emerging_Security_Threats
Maurizio_Taffone_Emerging_Security_Threats
 
Winload.efi.mui
Winload.efi.muiWinload.efi.mui
Winload.efi.mui
 
TopicModelingNLPHandsOnML
TopicModelingNLPHandsOnMLTopicModelingNLPHandsOnML
TopicModelingNLPHandsOnML
 
Hoja de vida jogc
Hoja de vida jogcHoja de vida jogc
Hoja de vida jogc
 
Newdoc maxis
Newdoc maxisNewdoc maxis
Newdoc maxis
 
Using Phing for Fun and Profit
Using Phing for Fun and ProfitUsing Phing for Fun and Profit
Using Phing for Fun and Profit
 
Using Elixir to fight Covid-19
Using Elixir to fight Covid-19Using Elixir to fight Covid-19
Using Elixir to fight Covid-19
 
Text fabric
Text fabricText fabric
Text fabric
 
209 Free Classical Mathematics Textbooks - All The Math Books You'll Ever Need
209 Free Classical Mathematics Textbooks - All The Math Books You'll Ever Need209 Free Classical Mathematics Textbooks - All The Math Books You'll Ever Need
209 Free Classical Mathematics Textbooks - All The Math Books You'll Ever Need
 
Why doesn't linux need defragmenting, Linux Defragmentation, Defragment, Linu...
Why doesn't linux need defragmenting, Linux Defragmentation, Defragment, Linu...Why doesn't linux need defragmenting, Linux Defragmentation, Defragment, Linu...
Why doesn't linux need defragmenting, Linux Defragmentation, Defragment, Linu...
 
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHPphp[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
 
From Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated DataFrom Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated Data
 
Actor‐Network Theory VS Network Analysis VS Digital Networks Are We Talking A...
Actor‐Network Theory VS Network Analysis VS Digital Networks Are We Talking A...Actor‐Network Theory VS Network Analysis VS Digital Networks Are We Talking A...
Actor‐Network Theory VS Network Analysis VS Digital Networks Are We Talking A...
 
Spacebrew MADess: Running Your Own Server
Spacebrew MADess: Running Your Own ServerSpacebrew MADess: Running Your Own Server
Spacebrew MADess: Running Your Own Server
 
Network gateway
Network gatewayNetwork gateway
Network gateway
 

Similar to Profiling Web Archives IIPC GA 2015

Testing Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax ExamTesting Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax Exam
Henryk Konsek
 
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
Y o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docxY o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docx
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
herminaprocter
 
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
Y o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docxY o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docx
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
odiliagilby
 
Modeling avengers – open source technology mix for saving the world
Modeling avengers – open source technology mix for saving the worldModeling avengers – open source technology mix for saving the world
Modeling avengers – open source technology mix for saving the world
Cédric Brun
 
Web services
Web servicesWeb services
Web services
lopjuan
 

Similar to Profiling Web Archives IIPC GA 2015 (20)

Testing Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax ExamTesting Fuse Fabric with Pax Exam
Testing Fuse Fabric with Pax Exam
 
Awesome Traefik - Ingress Controller for Kubernetes - Swapnasagar Pradhan
Awesome Traefik - Ingress Controller for Kubernetes - Swapnasagar PradhanAwesome Traefik - Ingress Controller for Kubernetes - Swapnasagar Pradhan
Awesome Traefik - Ingress Controller for Kubernetes - Swapnasagar Pradhan
 
What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...What every C++ programmer should know about modern compilers (w/o comments, A...
What every C++ programmer should know about modern compilers (w/o comments, A...
 
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
Y o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docxY o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docx
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
 
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
Y o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docxY o u r  N a m e   L S P  2 0 0 - 3 2 0  ( c o u r s e  I .docx
Y o u r N a m e L S P 2 0 0 - 3 2 0 ( c o u r s e I .docx
 
Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016 Spring Roo 2.0 Preview at Spring I/O 2016
Spring Roo 2.0 Preview at Spring I/O 2016
 
Breathe life into your designer!
Breathe life into your designer!Breathe life into your designer!
Breathe life into your designer!
 
Obstruction lights - Lentoestevalot
Obstruction lights - LentoestevalotObstruction lights - Lentoestevalot
Obstruction lights - Lentoestevalot
 
InterCon 2016 - Internet of “Thinking” – IoT sem BS com ESP8266
InterCon 2016 - Internet of “Thinking” – IoT sem BS com ESP8266InterCon 2016 - Internet of “Thinking” – IoT sem BS com ESP8266
InterCon 2016 - Internet of “Thinking” – IoT sem BS com ESP8266
 
InterCon 2016 - Blockchain e smart-contracts em Ethereu
InterCon 2016 - Blockchain e smart-contracts em EthereuInterCon 2016 - Blockchain e smart-contracts em Ethereu
InterCon 2016 - Blockchain e smart-contracts em Ethereu
 
Modeling avengers – open source technology mix for saving the world
Modeling avengers – open source technology mix for saving the worldModeling avengers – open source technology mix for saving the world
Modeling avengers – open source technology mix for saving the world
 
Semantic SEO in the post Hummingbird Era and WordLift
Semantic SEO in the post Hummingbird Era and WordLiftSemantic SEO in the post Hummingbird Era and WordLift
Semantic SEO in the post Hummingbird Era and WordLift
 
Web services
Web servicesWeb services
Web services
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS Preprocessors
 
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
 
Zend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHPZend con 2016 - Asynchronous Prorgamming in PHP
Zend con 2016 - Asynchronous Prorgamming in PHP
 
An Introduction to PHP Dependency Management With Composer
An Introduction to PHP Dependency Management With ComposerAn Introduction to PHP Dependency Management With Composer
An Introduction to PHP Dependency Management With Composer
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Geb for Testing Your Grails Application GR8Conf India 2016
Geb for Testing Your Grails Application  GR8Conf India 2016Geb for Testing Your Grails Application  GR8Conf India 2016
Geb for Testing Your Grails Application GR8Conf India 2016
 
Reflective Teaching: Improving Library Instruction Through Self-Reflection
Reflective Teaching: Improving Library Instruction Through Self-ReflectionReflective Teaching: Improving Library Instruction Through Self-Reflection
Reflective Teaching: Improving Library Instruction Through Self-Reflection
 

More from Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

More from Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingMementoMap Framework for Flexible and Adaptive Web Archive Profiling
MementoMap Framework for Flexible and Adaptive Web Archive Profiling
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web ArchivingInterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 

Recently uploaded

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 

Recently uploaded (20)

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 

Profiling Web Archives IIPC GA 2015

  • 1. Profiling Web Archives memento and Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Sawood Alam Michael L. Nelson Los Alamos National Laboratory, Los Alamos, NM Herbert Van de Sompel Stanford University Libraries, Stanford, CA David S. H. Rosenthal
  • 2. Memento Aggregator memento Aggregates ~20 archives and counting Only a few archives return good results for any query Time, network, and resource wastage Query routing can be helpful
  • 3. Long Tail Matters 400B+ web pages at IA do not cover everything Top three archives after IA produce full TimeMap 52% of the time Targeted crawls Special focus archives Restricted resources Private archives The Portuguese Web Archive and Memento unveil the first  homepage of the Smithsonian Institution from May 1995...  fb.me/3VAo6gEba 1:12 PM ­ 5 Jan 2015     8   1 PortugueseWebArchive  ​@PT_WebArchive  Follow Dennis Ritchie's Homepage has been deleted: cm.bell­ labs.com/cm/cs/who/dmr/ ­ and the site has a robots.txt  that blocks it from the Wayback. 2:37 PM ­ 22 Apr 2015     76   23 Jason Scott  ​@textfiles  Follow
  • 6. Available Profiling Resources Client request Archive response Archive index (CDX files)
  • 7. A Client Request Canonical URL Accept-Datetime (optional) Accept-Language (optional) G E T / t i m e g a t e / h t t p : / / w w w . c n n . c o m / H T T P / 1 . 1 H o s t : m e m e n t o w e b . o r g A c c e p t : t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l ; q = 0 . 9 , i m a g e / w e b p , * / * ; q = 0 . 8 A c c e p t - E n c o d i n g : g z i p , d e f l a t e , s d c h A c c e p t - D a t e t i m e : S a t , 1 6 J u n 2 0 1 2 0 0 : 0 0 : 0 0 G M T A c c e p t - L a n g u a g e : e n - U S , e n ; q = 0 . 8 C a c h e - C o n t r o l : m a x - a g e = 0 I f - M o d i f i e d - S i n c e : T h u , 2 3 A p r 2 0 1 5 1 6 : 5 1 : 5 0 G M T I f - N o n e - M a t c h : " 7 f f 8 - 5 1 4 6 7 1 8 9 2 9 5 8 0 " C o n n e c t i o n : k e e p - a l i v e C o o k i e : _ _ u n a m = 3 4 c 3 c 7 d - 1 4 c e 9 1 7 c e 6 2 - 4 3 c 3 8 e 5 e - 7 . . . U s e r - A g e n t : M o z i l l a / 5 . 0 L i n u x x 8 6 _ 6 4 C h r o m e / 4 2 . 0 . 2 3 1 1 . 9 0 . . .
  • 8. An Archive Response Canonical URL (known) Memento-Datetime Original Content-Language (optional) H T T P / 1 . 1 2 0 0 O K S e r v e r : T e n g i n e / 2 . 0 . 3 D a t e : S u n , 2 6 A p r 2 0 1 5 0 0 : 2 5 : 5 7 G M T C o n t e n t - T y p e : t e x t / h t m l ; c h a r s e t = u t f - 8 C o n t e n t - L e n g t h : 8 5 9 4 5 C o n n e c t i o n : k e e p - a l i v e s e t - c o o k i e : w a y b a c k _ s e r v e r = 3 7 ; D o m a i n = a r c h i v e . o r g ; P a t h = / ; E x p i r e s = T u e , 2 6 - M a M e m e n t o - D a t e t i m e : S a t , 2 5 A p r 2 0 1 5 1 3 : 3 8 : 1 6 G M T L i n k : ; r e l = " o r i g i n a l " , ; r e l = " t i m e m a p " ; t y p e = " a p p l i c a t i o n / l i n k - f o r m a t " , X - A r c h i v e - G u e s s e d - C h a r s e t : U T F - 8 X - A r c h i v e - O r i g - v i a : 1 . 1 v a r n i s h , 1 . 1 v a r n i s h , 1 . 1 v a r n i s h X - A r c h i v e - O r i g - c o n t e n t - l a n g u a g e : e n X - A r c h i v e - O r i g - x - c o n t e n t - t y p e - o p t i o n s : n o s n i f f X - A r c h i v e - O r i g - v a r y : A c c e p t - E n c o d i n g , C o o k i e X - A r c h i v e - O r i g - c o n t e n t - t y p e : t e x t / h t m l ; c h a r s e t = U T F - 8 X - A r c h i v e - O r i g - c a c h e - c o n t r o l : p r i v a t e , s - m a x a g e = 0 , m a x - a g e = 0 , m u s t - r e v a l i d a t e X - A r c h i v e - O r i g - s e r v e r : A p a c h e
  • 9. A CDX Snippet Canonical URL Memento Datetime c n n . c o m / 2 0 0 8 0 2 2 6 1 9 3 7 5 7 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 2 Q 4 O Z S V K P Z M U F 3 6 U N 6 c n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J c n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J i . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 1 0 2 0 8 3 5 5 4 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l i . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 4 0 4 1 7 2 9 1 3 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l
  • 10. Complete URI-R Profiling Sanderson et al. created a URIR profile for various archives Extracted every URI-R from all the CDX files Gained complete knowledge of the holding of the participating archives Profiles were huge Difficult to keep up-to-date Misses URI-Rs added later in the archive
  • 11. TLD-only Profiling AlSum et al. created a TLD profile for various archives Collected statistics about various archives on various TLDs Lightweight profiles Lots of false-positives All the ".com" queries will be routed to an archive that has only a few URI-Rs with ".com" TLD
  • 12. Middle Ground Partial URI-Rs, such as: Registered domain name Complete domain name (along with any sub-domains) Complete domain name and first few path segments Registered domain name and counts of other segments such as sub-domain, path, and query parameter Combining above with other attributes such as Content- Language and Memento-Datetime
  • 13. Archive Profile High-level digest of an archive Predicts presence of mementos of a URI-R in an archive Provides various statistics about the holdings Small in size Publicly available Easy to update and partially patch Useful for Memento query routing and other things
  • 14. Structure A r c h i v e m e t a d a t a S t a t i s t i c s : P r o f i l e t y p e s : K e y s : F r e q u e n c y m e a s u r e m e n t s
  • 15. Profile types URI-R based Complete URI-R TLD only URI-R hashes, such as: Only first few segments of the URI-R (Sub-URI) Registered domain name along with counts of other segments (Segment-Digest) Language Datetime Many more...
  • 16. Keys Depend on the profile type Control the balance between profile size and details U R I - R : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " T L D : " . u k " S u b - U R I : " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s " S e g - D i g e s t : " 0 / b b c . c o . u k / 4 " L a n g u a g e : " e n - G B " D a t e t i m e : " 2 0 1 4 0 3 " # Y Y Y Y M M
  • 17. Frequency Measurements Can have the same structure for all profile types Flexible to choose the attribute set to be included Affects the profile complexity Predicts the presence of the mementos of a URI-R " u k , c o , b b c ) / " : u r i m : m a x : 2 m i n : 1 t o t a l : 1 2 8 u r i r : 1 1 5
  • 18. Horizontal and Vertical Holdings " u k , c o , b b c ) / " : u r i m : m a x : 1 0 0 m i n : 1 0 0 t o t a l : 1 0 0 u r i r : 1 " u k , c o , b b c ) / " : u r i m : m a x : 1 m i n : 1 t o t a l : 1 0 0 u r i r : 1 0 0 " u k , c o , b b c ) / " : u r i m : m a x : 2 0 m i n : 5 t o t a l : 1 0 0 u r i r : 1 0
  • 19. Sample Profile - - - " @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d " " @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / " a b o u t : a c c e s s p o i n t : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / " m e c h a n i s m : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x " n a m e : " U K W A 1 9 9 6 C o l l e c t i o n " p r o f i l e _ u p d a t e d : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z " s u b u r i _ c l a s s : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 " m o r e _ m e t a _ d a t a : " . . . " s t a t s : l a n g u a g e : " e n - U S " : u r i m : { m a x : 1 3 , m i n : 1 , t o t a l : 4 7 5 2 9 } u r i r : 2 5 6 2 1 " m o r e _ l a n g u a g e s " : " . . . " s u b u r i : " u k ) / " : u r i m : { m a x : 8 , m i n : 1 , t o t a l : 9 3 2 4 3 2 } u r i r : 8 6 7 8 1 7 " u k , c o ) / " : u r i m : { m a x : 8 , m i n : 1 , t o t a l : 4 1 0 9 7 9 } u r i r : 3 7 8 6 8 6
  • 20. URI-R Based Profiles URI-R preprocessing Canonicalize Apply SURT Split segments Extract registered domain Count segments (sub-domain, path, query params) Generate all Sub-URIs Incrementally add segments from left-to-right Only up to max host and path segments config Create Segment-Digest with registered domain Prefix sub-domain count Suffix path and query params count
  • 21. Key Generation https://www.BBC.co.uk/images/Logo.png?width=200&height=80#f Intermediate Values { c a n o n i c a l _ u r l : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " , s u r t _ u r l : " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " , r e g _ d o m a i n : " b b c . c o . u k " , p a t h _ i n i t i a l : " i " , s u b d o m a i n _ c o u n t : 1 , p a t h _ c o u n t : 2 , q u e r y _ p a r a m s _ c o u n t : 2 } Sub-URI(H 3 P 1 ) [ " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s " ] SegDigest( include_path_initial) " 1 / b b c . c o . u k / i 4 "
  • 22. Implementation GitHub: A python module to generate Sub-URIs from SURT GitHub: Various profile generation scripts /oduwsdl/suburi_generator /oduwsdl/archive_profiler
  • 23. Canonicalization Remove "http(s)", "www", and fragment of a URI Downcase hostname Remove some known query paras e.g., "jsessionid" Sort query params by keys and values (secondary) U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f " C a n o n i c a l i z e ( U R L ) # = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
  • 24. Sort-friendly URI Reordering Transform (SURT) Take canonical URL as input Join hostname segments by commas in reverse order Separate hostname and path by closing parenthesis C a n _ U R L = " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " S U R T ( C a n _ U R L ) # = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "
  • 25. Sub-URI Take SURT URL as input Incrementally add segments from left-to-right one-by-one Stop if hostname or path segment limit policy reaches Return the list of all Sub-URIs S U R T _ U R L = " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " ) # = > [ " u k ) / " , # " u k , c o ) / " , # " u k , c o , b b c ) / " , # " u k , c o , b b c ) / i m a g e s " ]
  • 26. URL to Sub-URI U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f " C a n _ U R L = C a n o n i c a l i z e ( U R L ) # = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " S U R T _ U R L = S U R T ( C a n _ U R L ) # = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " S u b _ U R I s = S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " ) # = > [ " u k ) / " , # " u k , c o ) / " , # " u k , c o , b b c ) / " , # " u k , c o , b b c ) / i m a g e s " ]
  • 27. Segment Count Digest Extract registered domain name and initial letter of path Count sub-domain and trailing (path + query) segments Serialize as follows: { s u b d o m a i n _ c o u n t } / { r e g _ d o m a i n } / { p a t h _ i n i t i a l } ? { t r a i l i n g _ c o u n t } U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f " S e g m e n t s = S e g m e n t i z e ( U R L ) # = > { r e g _ d o m a i n : " b b c . c o . u k " , # p a t h _ i n i t i a l : " i " , # s u b d o m a i n _ c o u n t : 1 , # p a t h _ c o u n t : 2 , # q u e r y _ p a r a m s _ c o u n t : 2 , # t r a i l i n g _ c o u n t : 4 } S e g D i g e s t ( S e g m e n t s , p o l i c y = " e x c l u d e _ p a t h _ i n i t i a l " ) # = > " 1 / b b c . c o . u k / 4 " S e g D i g e s t ( S e g m e n t s , p o l i c y = " i n c l u d e _ p a t h _ i n i t i a l " ) # = > " 1 / b b c . c o . u k / i 4 "
  • 28. JSON Serialization Can have complex nested data structure JSON-LD for linked data No partial key lookup Unsuitable for text processing tools Allows processing only when fully loaded A single malformed character makes it unparsable Difficult to patch { " s u b u r i " : { " u k ) / " : { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } , " u k , c o ) / " : { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6 } , " u k , c o , b b c ) / " : { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8
  • 29. CDX-JSON Serialization Fusion of CDX and JSON file formats A key followed by strict single line JSON value Unlike CDX, values can have arbitrary attributes Text processing tool friendly No single root node or single document restrictions Enables binary search Enables partial key lookup Error resilient @ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d " @ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / " @ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " : u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } , u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6 u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } , u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :
  • 30. Merging Only process new data to periodically update for freshness Parallel processing Difficult to keep detailed measures with absolute values Derived simple heuristic measures to predict presence of mementos
  • 31. Merging Example Base Profile c o m , c n n ) / { " u r i r _ s u m " : 3 0 , " s o u r c e s " : 1 } , u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 } New Profile c o m , c n n ) / { " u r i r _ s u m " : 1 0 , " s o u r c e s " : 1 } , c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 } Merged Profile c o m , c n n ) / { " u r i r _ s u m " : 4 0 , " s o u r c e s " : 2 } , u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 } , c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }
  • 32. Dataset Two archives Three sample query sets Various profiles for each archive and sample set
  • 33. Archives Archive URI-Rs URI-Ms Size Archive-It 1.9B 5.3B 1.8TB UKWA 0.7B 1.7B 0.5TB
  • 34. Sample Query Sets Sample Size In Archive-It In UKWA DMOZ 100,000 4,042 1,896 MementoProxy 100,000 4,222 193 IAWayback 100,000 3,999 275
  • 35. Evaluation Relate CDX Size, URI-M, URI-R, and Sub-URI Analyze profile growth Estimate Relative Cost Evaluate Routing Precision vs. Relative Cost Relative Cost = |Keys in the Profile| |URI-R in the Archive| Routing Precision = |URI-R Present in the Archive| |URI-R Predicted by the Profile in Archive|
  • 36. UKWA Dataset Yearly data as seprate collections Average CDX line size: 275 bytes URI-M/URI-R ratio: 2.46
  • 37. Accumulated URI-R Growth (UKWA) Successive yearly data was merged Follows Heaps' Law K = 3.897 β = 0.892 = KCr C β m
  • 38. Sub-URI Key Growth (UKWA) Slope of the fit line is the Relative Cost for the profile policy Complete URI-R profile has Relative Cost 1
  • 40. Search Precision of Various Profiles Search Precision wrt TLD-only profile Double for H3P0 Five fold for HxP1 Segment-Digest is as good as H3P0
  • 41. Relative Cost vs. Search Precision Up to 22% routing precision with <5% Reltive Cost <0.3% sample URIs from MementoProxy and IAWayback logs present in UKWA Shallow crawling of UKWA results in higher cost
  • 42. Relative Profile Cost (UKWA) Profile Cost Profile Cost Profile Cost H1P0 3.2e-06 H3P2 0.26823 HxP2 0.38313 H2P0 0.00027 H3P3 0.37343 HxP3 0.53928 H2P1 0.00059 H4P0 0.01348 HxP4 0.63889 H2P2 0.00099 H5P0 0.01388 HxP5 0.71568 H3P0 0.00862 HxP0 0.01401 HxPx 0.83107 H3P1 0.11864 HxP1 0.16349 URIR 1.00000
  • 43. Future Work Generating sample URI sets Profiling via sampling Language profiles Evaluation of combination profiles such as Sub-URI along with Datetime Profiles for usage other than Memento routing, such as, Media-type profiles (e.g., images, pdf, audio etc.) Site classification based profiles (e.g., news, wiki, social media, blog etc.)
  • 44. Conclusions Generated profiles with different policies for two archives Examined cost-accuracy trade-offs of various profiles Related CDX Size, URI-M, URI-R, and Sub-URI Gained up to 22% routing precision with <5% relative cost without any false negatives <5% of the queried URIs are present in each of the individual archives Implementation codes are available at: GitHub: GitHub: /oduwsdl/suburi_generator /oduwsdl/archive_profiler