Web Archiving Profile - WADL 2013

1,729
-1

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,729
On Slideshare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Web Archiving Profile - WADL 2013

  1. 1. Web Archiving Profile OverviewAhmed AlSum PhD Candidate Old Dominion University Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013 Indianapolis, Indiana, USA
  2. 2. What is the problem? • Web Archives are blackbox, it just accessible through textbox search (full-text or URI-lookup) • We need to profile/characterize the web archives around the world such as: o Age o Top-level domains o Languages o Growth rate
  3. 3. Why • To optimize the query routing for Memento Aggregator. • To determine the missing parts of the web.
  4. 4. Who Full text URI-lookup Internet Archive x Library of Congress x Icelandic Web Archive x Library and Archives Canada x x British Library x x UK National Library x x Portuguese Web Archive x x Web Archive of Catalonia x x Croatian Web Archive x x Archive of the Czech Web x x National Taiwan University x x Archive IT x x
  5. 5. How • Sampling from different sources • Retrieve the TimeMap from each archive • Analyze the TimeMaps
  6. 6. URIs Samples Sources Web 1. DMOZ – Random sample 2. DMOZ – TLD %2 of each TLD from DMOZ (.com, .org, .jp, etc 52 TLD) 3. DMOZ – Languages 100 URIs for each Languages (24 lang.) Web Archives 4. Top 1-Gram from Bing 5. Top 1000 queries term by Yahoo in 9 languages User requests 6. IA Wayback Machine Log files 7. Memento aggregator log files * We used hostnames only
  7. 7. General Coverage
  8. 8. Web Archive Growth Rate
  9. 9. TLD Sample Coverage
  10. 10. TLD per archive (TLD Sample)
  11. 11. TLD per archive (Fulltext search)
  12. 12. TLD across archives
  13. 13. Languages distribution per archive
  14. 14. Query Routing Evaluation
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×