More Archives, More Better

1,436 views

Published on

Memento Aggregator Update at IIPC 2013 Ljubljana Slovenia April 23, 2013 http://netpreserve.org/general-assembly/2013/

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,436
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

More Archives, More Better

  1. 1. More Archives, More BetterMichael L. NelsonOld Dominion Universityws-dl.blogspot.comIIPC General AssemblyLjubljana, SloveniaApril 23, 2013
  2. 2. Three Easy Pieces• "An Evaluation of Caching Policies for Memento TimeMaps"– 4000 aggregated TimeMaps downloaded daily for 3 months– 20% of the time the TimeMaps shrink• "How Much of The Web Is Archived?"– 4000 URIs, 9 archives, 3 search engines– 16% -- 79% of the web archived• "Profiling Web Archive Coverage for Top-Level Domain andContent Language"– 153329 URIs, 12 archives– querying only top 3 archives gives a complete TimeMap84% of the time (52% of the time even if you exclude the IA)
  3. 3. An Evaluation of Caching Policiesfor Memento TimeMapsJCDL 2013Justin Brunelle, Michael L. Nelson
  4. 4. Mean # Mementos per TimeMap per Daydownload the same 4000 TimeMaps everydayODU OSupgradeIA API changesODU power outage
  5. 5. Frequency of TimeMap changes over 92 days
  6. 6. Optimal TimeMap Cache TTL=15 daysminimizes queries to archives, minimizes "lost" mementos*days,will only cache new TimeMap if it is "bigger"question: can we do this adaptively?
  7. 7. How Much of The Web Is Archived?JCDL 2011Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen,Michele C. Weigle, Michael L. Nelson
  8. 8. Public Archives, ca. late 2010 / early 2011Three categories of archives• Internet ArchiveInternet Archive (classic interface)(classic interface)• Search engineSearch engine• Other archivesOther archivesUK US
  9. 9. 1000 URIs, ordered by first observation dateSee also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
  10. 10. see also: http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
  11. 11. How Much of the Web is Archived?It depends on which web…IncludingSE cacheExcludingSE Cache90% 79%97% 68%35% 16%88% 19%Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period
  12. 12. Profiling Web Archive Coverage forTop-Level Domain and Content Language(submitted for publication)Ahmed AlSum, Michele C. Weigle, Michael L. Nelson,Herbert Van de Sompel
  13. 13. 12 (IIPC) Archives153329 URIs from DMOZ, archive fulltext search, IA logs, Memento aggregator logs
  14. 14. Temporal Spread
  15. 15. Rate of Acquiring URI-Rs, URI-Ms
  16. 16. TLD / Archive (DMOZ TLD sample; others similar)
  17. 17. Archive / TLD Heatmap
  18. 18. Using Only Top-k Archives for URI LookupYields Good ResultsEven when there are 100s of archives, we only need to talk to a few.

×