Presentation at the Digital Libraries conference 2014 (DL 2014), in London, UK. Nominated for Best Paper award. Full paper available via: humanities.uva.nl/~kamps/publications/2014/huur:find14.pdf
Finding Pages on the Unarchived Web
Hugo Huurdeman, Anat Ben-David, Jaap Kamps
Thaer Samar, Arjen de Vries"
"
University of Amsterdam, Centrum Wiskunde & Informatica"
"
"
"
Presentation ACM/IEEE Digital Libraries conference 2014
Introduction
• Web archives preserve the fast-changing
Web
• However, they cannot capture
the entire Web due to various
limitations"
• Recrawl “lost” webpages
impossible
"
• Would it be possible to recover
parts of the unarchived Web?
0.1 Background: Web archiving
• Web archives, keepers of our
future cultural heritage, are
inherently incomplete
• e.g. due to limitations in crawling
[Masanès06]
"
• However, crawlers do register
additional information, e.g.
• page source, link structure,
server metadata, timestamps, ..
• potentially usable for analytical
purposes (e.g. [Rauber06])
0.1 Background: Link evidence and anchor text
• Defining property of the Web:
graph-based structure
• links: src, destination, anchor text
• Widely used in Web retrieval
• [e.g. Craswell01, Fujii08,
Koolen10]
"
• Our approach
• inspired by previous results on
Web-centric document
representations
• Our use case: the Web archive
0.2 Data: Dutch Web Archive
• National Library of the
Netherlands (KB) "
"
• Selective Web archive"
• 2007-now
• 10+ Terabyte
• seedlist: 8000+ websites
• 25,000+ harvests
"
• Our focus: one year of data
(2012)
0.2 Data: extraction and processing
extracting links from all pages"
{destination URL, anchor text,
hashcode src, crawldate}
matching with seedlist
adding KB metadata
deduplication (per year)"
to correct for harvesting frequencies
cleaning and processing"
e.g. URL normalization
MySQL DB
(13M. rows)
aggregation and data
enrichment"
e.g. filetypes, counts, ..
Research Questions
1 Can we recover a significant fraction of unarchived pages from
references to them in the Web archive?
"
2 How rich are the representations that can be created for
unarchived URLs?
"
3 Are the resulting derived representations of unarchived pages
useful in practice? Do they capture enough of the unique page
content to make them retrievable amongst millions of other pages?
1. Expanding the Web archive
Can we recover a significant fraction of
unarchived pages from references
in the Web archive?
1.2 Unarchived content: the aura
• the aura of the web
archive
• pages not in archive
• but existence can be
derived from link evidence
in the archive
"
• distinguishing
• inner aura (parent domain
on the seedlist)
• outer aura (parent domain
not in the seedlist)
Dutch Web Archive 1 2
1.3 Characterizing the Aura: tld distribution
Inner aura
2%
96%
nl
com
org
net
other
Outer aura
10%
2%
18%
5%
31%
35% nl
com
org
jp
net
other
mainly .nl content more mixed distribution
(incl. .com, .org & .net)
1.3 Characterizing the Aura: coverage Alexa top 100
• Inner aura!
• includes 7 of 100 most
popular Dutch sites
280
210
140
70
0
twitter.com facebook.com linkedin.com hyves.nl google.com
280
210
140
70
0
nu.nl wikipedia blogspot.com kvk.nl anwb.nl
• Outer aura!
• includes 90 of 100 most
popular Dutch sites (1.2M
references)
1.4 Expanding the Web archive: summary
• Recovered pages and hosts:
"
• Substantial amount
• as many references to unarchived
content as pages in the archive
"
• Complementing sites in archive
"
• Indirect evidence of lost
Webpages holds the potential
to significantly expand the Web
archive’s coverage
20,0
15,0
10,0
5,0
0,0
Unarchived pages (M)
Archived pages (M)
2. Representations!
"
How rich are the representations "
that can be created for
unarchived URLs?
2.1 Representations of unarchived content: indegree
• Characteristics of incoming links (indegree)
• All target representations: link from at least 1 unique page (b/o MD5)
• 18% at least 3 unique incoming links
• 10% has 5 links or more
100,00%!
90,00%!
80,00%!
70,00%!
60,00%!
50,00%!
40,00%!
30,00%!
20,00%!
10,00%!
0,00%!
1! 2! 3! 4! 5! 6! 7! 8! 9! 10!
subset coverage!
!
!
indegree (unique source pages)!
inner aura!
outer aura!
2.2 Representations: anchor text distribution
• Further inspecting the richness: number of unique words
"
• 95% has 1 unique word or more
• christinaconcours.nl:
concertagenda (5)
• 30% has 3 unique words or more
• watou2009.be:
watou (3) collection (2) stories (2)
• 3% has 10 words or more
• jos.rotterdam.nl:
society (2) service (2) youth
(2) and (2) education (2)
wwwjosrotterdamnl (1)
municipality (1) governance (1)
jos (4) rotterdam (3)
100,00%!
90,00%!
80,00%!
70,00%!
60,00%!
50,00%!
40,00%!
30,00%!
20,00%!
10,00%!
0,00%!
0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15!
subset coverage!
!
!
unique word count (anchor text)!
inner aura!
outer aura!
2.3 Representations: homepages & non-homepages
• Anchors often refer to homepages [e.g. Craswell01]
• In our dataset: homepages for 336K of 481K hosts (69.8%)
• homepage: vakcentrum.nl (6 unique anchors)
•
"
• non-homepage: nesomexico.org/dutch-students/study-in-mexico/study-grants-and-
loans/ (2 unique anchors — combine with URL words)
2.4 Representations of unarchived content: summary
• Richness of representations:
"
• Results mixed:
• skewed distribution
• majority of pages: relatively
sparse descriptions
• minority of pages: relatively rich
descriptions
"
• Are the representations rich
enough to characterize the
page’s contents?
3. Finding Unarchived Pages!
"
Are the representations of "
unarchived pages useful "
in practice?
3.1 Finding unarchived pages: evaluation setup
• Indexed 5.19M representations unarchived content (outer aura)
• three indexes:
"
"
"
"
anchT urlW
anchT
UrlW
"
aggregated anchor text only URL words both
"
• Stratified sample: 500 homepages & 500 non-homepages
• Pages (if available via IA / live Web) consulted by two annotators
• creating known-item topics (150 per category)
• inspect target page
• write down query for refinding (without knowledge of anchor text)
• result: 300 queries (~5-7 words)
3.2 Evaluation: results
• Mean Reciprocal Rank (MRR)"
• average scores of first correct
result of each query
• score: 1/rank
• Results: "
• homepages score better for anchor text representations
• URL words representation better for non-homepages
• combined representation improves MRR score for both
• average close to 0.5: average case correct result 2nd rank
3.2 Evaluation: results
• Success Rate @10: correct target page in top 10
"
"
"
"
"
• Similar to MRR results:
• homepages score better for anchor text
• non-homepages score better for URL words representations
"
• On average, 59.7% of the correct homepages and non-homepages
can be retrieved in the top 10
3.3 Evaluation: Impact of indegree
• Impact of incoming links on richness of representations
cer.org.uk!
(5 anchor words)
actionaid.org/kenya!
(1 anchor word)
3.3 Evaluation: Impact of indegree (unique hosts)
"
"
"
"
"
"
"
"
• Again, skewed nature:
"
• 251 out of 300 pages (84%) have links from 1 source
• 49 pages (16%) have links from 2 or more sources
"
"
16%
84%
• Higher indegree (unique hosts) results in rise in
• mean word count
• MRR
• degree of homepages
3.3 Finding unarchived pages: summary
• Usefulness in practice
• Critical test: known-item finding
• Generally positive results
"
• Unavailability of pages strengthens
potential utility representations:
• 20.1% of homepages
• 45.4% of non-homepages
not available via live Web or Internet
Archive
4.1 Conclusions
• Approach to recover significant parts of the
unarchived Web
• by reconstructing descriptions based on link
evidence
"
1. Evidence high number unarchived pages
• potentially increasing archive coverage
"
2. Skewed distribution generated descriptions
• popular pages have more terms
• richness tapers off quickly
"
3. Succint representation generally rich enough
to identify pages
• in a known item search setting
"
4.2 Future work & Discussion
"
"
"
• Representations could be useful in research and institutional
context, e.g.
• helping to assess the completeness of the archive
• extending seedlists for selection-based archives
• potential representation popular unarchived sites, excluded
from archiving
"
• Potentially enrich web archive systems with
contextual information
Web Archive
• Aggregation per year: refine and extend to longitudinal case
• Assessing the impact of crawling strategies
• Incorporating additional contextual information
• e.g. text surrounding anchors
• Optimally weigh all sources of evidence, using advanced retrieval models
Acknowledgements
• We gratefully acknowledge the
collaboration with the Dutch
Web Archive of the National
Library of the Netherlands.
"
• This research was supported by
the Netherlands Organization
for Scientific Research
(WebART project, NWO CATCH
# 640.005.001).
References
• [Craswell01] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using
link anchor information,” in SIGIR. ACM, 2001, pp. 250–257.
• [Fuji08] A. Fujii, “Modeling anchor text and classifying queries to enhance web
document retrieval,” in WWW, J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins,
and X. Zhang, Eds. ACM, 2008, pp. 337–346.
• [Kamps06] J. Kamps, “Web-centric language models,” in CIKM, O. Herzog, H.-J.
Schek, N. Fuhr, A. Chowdhury, and W. Teiken, Eds. ACM, 2005, pp. 307–308.
• [Masanès06] J. Masanès, Web archiving. Springer, 2006
• [Koolen10] M. Koolen and J. Kamps, “The importance of anchor text for ad hoc search
revisited,” in SIGIR, F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and
J. Savoy, Eds. ACM, 2010, pp. 122–129.
• [Rauber02] A. Rauber, R. M. Bruckner, A. Aschenbrenner, O. Witvoet, and M. Kaiser,
“Uncovering information hidden in web archives: A glimpse at web analysis building on
data warehouses,” D-Lib Magazine, vol. 8, no. 12, 2002.
• [Unesco03] UNESCO, “Charter on the preservation of digital heritage (article 3.4),”
2003.
Finding Pages on the Unarchived Web
Hugo Huurdeman, Anat Ben-David, Jaap Kamps
Thaer Samar, Arjen de Vries"
"
University of Amsterdam, Centrum Wiskunde & Informatica"
"
"
"