Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding Pages on the Unarchived Web (DL 2014)

2,015 views

Published on

Presentation at the Digital Libraries conference 2014 (DL 2014), in London, UK. Nominated for Best Paper award. Full paper available via: humanities.uva.nl/~kamps/publications/2014/huur:find14.pdf

  • Be the first to comment

Finding Pages on the Unarchived Web (DL 2014)

  1. 1. Finding Pages on the Unarchived Web Hugo Huurdeman, Anat Ben-David, Jaap Kamps Thaer Samar, Arjen de Vries" " University of Amsterdam, Centrum Wiskunde & Informatica" " " " Presentation ACM/IEEE Digital Libraries conference 2014
  2. 2. Introduction • Web archives preserve the fast-changing Web • However, they cannot capture the entire Web due to various limitations" • Recrawl “lost” webpages impossible " • Would it be possible to recover parts of the unarchived Web?
  3. 3. Background & experimental setup
  4. 4. 0.1 Background: Web archiving • Web archives, keepers of our future cultural heritage, are inherently incomplete • e.g. due to limitations in crawling [Masanès06] " • However, crawlers do register additional information, e.g. • page source, link structure, server metadata, timestamps, .. • potentially usable for analytical purposes (e.g. [Rauber06])
  5. 5. 0.1 Background: Link evidence and anchor text • Defining property of the Web: graph-based structure • links: src, destination, anchor text • Widely used in Web retrieval • [e.g. Craswell01, Fujii08, Koolen10] " • Our approach • inspired by previous results on Web-centric document representations • Our use case: the Web archive
  6. 6. 0.2 Data: Dutch Web Archive • National Library of the Netherlands (KB) " " • Selective Web archive" • 2007-now • 10+ Terabyte • seedlist: 8000+ websites • 25,000+ harvests " • Our focus: one year of data (2012)
  7. 7. 0.2 Data: extraction and processing extracting links from all pages" {destination URL, anchor text, hashcode src, crawldate} matching with seedlist adding KB metadata deduplication (per year)" to correct for harvesting frequencies cleaning and processing" e.g. URL normalization MySQL DB (13M. rows) aggregation and data enrichment" e.g. filetypes, counts, ..
  8. 8. Research Questions 1 Can we recover a significant fraction of unarchived pages from references to them in the Web archive? " 2 How rich are the representations that can be created for unarchived URLs? " 3 Are the resulting derived representations of unarchived pages useful in practice? Do they capture enough of the unique page content to make them retrievable amongst millions of other pages?
  9. 9. 1. Expanding the Web archive Can we recover a significant fraction of unarchived pages from references in the Web archive?
  10. 10. 1.1 Archived content (2012) Dutch Web Archive 1 2 1. Contents in seedlist (2012) • 10.2M unique pages • 6,157 unique hosts • 3,413 unique domains • 16 TLDs " 2. Contents not in seedlist (2012) • 0.9M unique pages • 37,166 unique hosts • 30,367 unique domains • 181 TLDs
  11. 11. 1.2 Unarchived content: the aura • the aura of the web archive • pages not in archive • but existence can be derived from link evidence in the archive " • distinguishing • inner aura (parent domain on the seedlist) • outer aura (parent domain not in the seedlist) Dutch Web Archive 1 2
  12. 12. 1.2 Unarchived content (2012) 3. Inner aura • 5.5M unique pages • 9,039 unique hosts • 3,019 unique domains • 17 TLDs " 4.Outer aura • 5.2M unique pages • 481,797 unique hosts • 369,721 unique domains • 100 TLDs Dutch Web Archive 1 2 3 4
  13. 13. 1.3 Characterizing the Aura: tld distribution Inner aura 2% 96% nl com org net other Outer aura 10% 2% 18% 5% 31% 35% nl com org jp net other mainly .nl content more mixed distribution (incl. .com, .org & .net)
  14. 14. 1.3 Characterizing the Aura: coverage Alexa top 100 • Inner aura! • includes 7 of 100 most popular Dutch sites 280 210 140 70 0 twitter.com facebook.com linkedin.com hyves.nl google.com 280 210 140 70 0 nu.nl wikipedia blogspot.com kvk.nl anwb.nl • Outer aura! • includes 90 of 100 most popular Dutch sites (1.2M references)
  15. 15. 1.4 Expanding the Web archive: summary • Recovered pages and hosts: " • Substantial amount • as many references to unarchived content as pages in the archive " • Complementing sites in archive " • Indirect evidence of lost Webpages holds the potential to significantly expand the Web archive’s coverage 20,0 15,0 10,0 5,0 0,0 Unarchived pages (M) Archived pages (M)
  16. 16. 2. Representations! " How rich are the representations " that can be created for unarchived URLs?
  17. 17. 2.1 Representations of unarchived content: indegree • Characteristics of incoming links (indegree) • All target representations: link from at least 1 unique page (b/o MD5) • 18% at least 3 unique incoming links • 10% has 5 links or more 100,00%! 90,00%! 80,00%! 70,00%! 60,00%! 50,00%! 40,00%! 30,00%! 20,00%! 10,00%! 0,00%! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! subset coverage! ! ! indegree (unique source pages)! inner aura! outer aura!
  18. 18. 2.2 Representations: anchor text distribution • Further inspecting the richness: number of unique words " • 95% has 1 unique word or more • christinaconcours.nl: concertagenda (5) • 30% has 3 unique words or more • watou2009.be: watou (3) collection (2) stories (2) • 3% has 10 words or more • jos.rotterdam.nl: society (2) service (2) youth (2) and (2) education (2) wwwjosrotterdamnl (1) municipality (1) governance (1) jos (4) rotterdam (3) 100,00%! 90,00%! 80,00%! 70,00%! 60,00%! 50,00%! 40,00%! 30,00%! 20,00%! 10,00%! 0,00%! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! subset coverage! ! ! unique word count (anchor text)! inner aura! outer aura!
  19. 19. 2.3 Representations: homepages & non-homepages • Anchors often refer to homepages [e.g. Craswell01] • In our dataset: homepages for 336K of 481K hosts (69.8%) • homepage: vakcentrum.nl (6 unique anchors) • " • non-homepage: nesomexico.org/dutch-students/study-in-mexico/study-grants-and- loans/ (2 unique anchors — combine with URL words)
  20. 20. 2.4 Representations of unarchived content: summary • Richness of representations: " • Results mixed: • skewed distribution • majority of pages: relatively sparse descriptions • minority of pages: relatively rich descriptions " • Are the representations rich enough to characterize the page’s contents?
  21. 21. 3. Finding Unarchived Pages! " Are the representations of " unarchived pages useful " in practice?
  22. 22. 3.1 Finding unarchived pages: evaluation setup • Indexed 5.19M representations unarchived content (outer aura) • three indexes: " " " " anchT urlW anchT UrlW " aggregated anchor text only URL words both " • Stratified sample: 500 homepages & 500 non-homepages • Pages (if available via IA / live Web) consulted by two annotators • creating known-item topics (150 per category) • inspect target page • write down query for refinding (without knowledge of anchor text) • result: 300 queries (~5-7 words)
  23. 23. 3.2 Evaluation: results • Mean Reciprocal Rank (MRR)" • average scores of first correct result of each query • score: 1/rank • Results: " • homepages score better for anchor text representations • URL words representation better for non-homepages • combined representation improves MRR score for both • average close to 0.5: average case correct result 2nd rank
  24. 24. 3.2 Evaluation: results • Success Rate @10: correct target page in top 10 " " " " " • Similar to MRR results: • homepages score better for anchor text • non-homepages score better for URL words representations " • On average, 59.7% of the correct homepages and non-homepages can be retrieved in the top 10
  25. 25. 3.3 Evaluation: Impact of indegree • Impact of incoming links on richness of representations cer.org.uk! (5 anchor words) actionaid.org/kenya! (1 anchor word)
  26. 26. 3.3 Evaluation: Impact of indegree (unique hosts) " " " " " " " " • Again, skewed nature: " • 251 out of 300 pages (84%) have links from 1 source • 49 pages (16%) have links from 2 or more sources " " 16% 84% • Higher indegree (unique hosts) results in rise in • mean word count • MRR • degree of homepages
  27. 27. 3.3 Finding unarchived pages: summary • Usefulness in practice • Critical test: known-item finding • Generally positive results " • Unavailability of pages strengthens potential utility representations: • 20.1% of homepages • 45.4% of non-homepages not available via live Web or Internet Archive
  28. 28. 4. Conclusion and Discussion
  29. 29. 4.1 Conclusions • Approach to recover significant parts of the unarchived Web • by reconstructing descriptions based on link evidence " 1. Evidence high number unarchived pages • potentially increasing archive coverage " 2. Skewed distribution generated descriptions • popular pages have more terms • richness tapers off quickly " 3. Succint representation generally rich enough to identify pages • in a known item search setting "
  30. 30. 4.2 Future work & Discussion " " " • Representations could be useful in research and institutional context, e.g. • helping to assess the completeness of the archive • extending seedlists for selection-based archives • potential representation popular unarchived sites, excluded from archiving " • Potentially enrich web archive systems with contextual information Web Archive • Aggregation per year: refine and extend to longitudinal case • Assessing the impact of crawling strategies • Incorporating additional contextual information • e.g. text surrounding anchors • Optimally weigh all sources of evidence, using advanced retrieval models
  31. 31. Acknowledgements • We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands. " • This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).
  32. 32. References • [Craswell01] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using link anchor information,” in SIGIR. ACM, 2001, pp. 250–257. • [Fuji08] A. Fujii, “Modeling anchor text and classifying queries to enhance web document retrieval,” in WWW, J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, Eds. ACM, 2008, pp. 337–346. • [Kamps06] J. Kamps, “Web-centric language models,” in CIKM, O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, and W. Teiken, Eds. ACM, 2005, pp. 307–308. • [Masanès06] J. Masanès, Web archiving. Springer, 2006 • [Koolen10] M. Koolen and J. Kamps, “The importance of anchor text for ad hoc search revisited,” in SIGIR, F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, Eds. ACM, 2010, pp. 122–129. • [Rauber02] A. Rauber, R. M. Bruckner, A. Aschenbrenner, O. Witvoet, and M. Kaiser, “Uncovering information hidden in web archives: A glimpse at web analysis building on data warehouses,” D-Lib Magazine, vol. 8, no. 12, 2002. • [Unesco03] UNESCO, “Charter on the preservation of digital heritage (article 3.4),” 2003.
  33. 33. Finding Pages on the Unarchived Web Hugo Huurdeman, Anat Ben-David, Jaap Kamps Thaer Samar, Arjen de Vries" " University of Amsterdam, Centrum Wiskunde & Informatica" " " "

×