Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Archiving Profile - WADL 2013


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Web Archiving Profile - WADL 2013

  1. 1. Web Archiving Profile OverviewAhmed AlSum PhD Candidate Old Dominion University Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013 Indianapolis, Indiana, USA
  2. 2. What is the problem? • Web Archives are blackbox, it just accessible through textbox search (full-text or URI-lookup) • We need to profile/characterize the web archives around the world such as: o Age o Top-level domains o Languages o Growth rate
  3. 3. Why • To optimize the query routing for Memento Aggregator. • To determine the missing parts of the web.
  4. 4. Who Full text URI-lookup Internet Archive x Library of Congress x Icelandic Web Archive x Library and Archives Canada x x British Library x x UK National Library x x Portuguese Web Archive x x Web Archive of Catalonia x x Croatian Web Archive x x Archive of the Czech Web x x National Taiwan University x x Archive IT x x
  5. 5. How • Sampling from different sources • Retrieve the TimeMap from each archive • Analyze the TimeMaps
  6. 6. URIs Samples Sources Web 1. DMOZ – Random sample 2. DMOZ – TLD %2 of each TLD from DMOZ (.com, .org, .jp, etc 52 TLD) 3. DMOZ – Languages 100 URIs for each Languages (24 lang.) Web Archives 4. Top 1-Gram from Bing 5. Top 1000 queries term by Yahoo in 9 languages User requests 6. IA Wayback Machine Log files 7. Memento aggregator log files * We used hostnames only
  7. 7. General Coverage
  8. 8. Web Archive Growth Rate
  9. 9. TLD Sample Coverage
  10. 10. TLD per archive (TLD Sample)
  11. 11. TLD per archive (Fulltext search)
  12. 12. TLD across archives
  13. 13. Languages distribution per archive
  14. 14. Query Routing Evaluation