• Like
More Archives, More Better
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

More Archives, More Better


Memento Aggregator Update at IIPC 2013 Ljubljana Slovenia April 23, 2013 http://netpreserve.org/general-assembly/2013/

Memento Aggregator Update at IIPC 2013 Ljubljana Slovenia April 23, 2013 http://netpreserve.org/general-assembly/2013/

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. More Archives, More BetterMichael L. NelsonOld Dominion Universityws-dl.blogspot.comIIPC General AssemblyLjubljana, SloveniaApril 23, 2013
  • 2. Three Easy Pieces• "An Evaluation of Caching Policies for Memento TimeMaps"– 4000 aggregated TimeMaps downloaded daily for 3 months– 20% of the time the TimeMaps shrink• "How Much of The Web Is Archived?"– 4000 URIs, 9 archives, 3 search engines– 16% -- 79% of the web archived• "Profiling Web Archive Coverage for Top-Level Domain andContent Language"– 153329 URIs, 12 archives– querying only top 3 archives gives a complete TimeMap84% of the time (52% of the time even if you exclude the IA)
  • 3. An Evaluation of Caching Policiesfor Memento TimeMapsJCDL 2013Justin Brunelle, Michael L. Nelson
  • 4. Mean # Mementos per TimeMap per Daydownload the same 4000 TimeMaps everydayODU OSupgradeIA API changesODU power outage
  • 5. Frequency of TimeMap changes over 92 days
  • 6. Optimal TimeMap Cache TTL=15 daysminimizes queries to archives, minimizes "lost" mementos*days,will only cache new TimeMap if it is "bigger"question: can we do this adaptively?
  • 7. How Much of The Web Is Archived?JCDL 2011Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen,Michele C. Weigle, Michael L. Nelson
  • 8. Public Archives, ca. late 2010 / early 2011Three categories of archives• Internet ArchiveInternet Archive (classic interface)(classic interface)• Search engineSearch engine• Other archivesOther archivesUK US
  • 9. 1000 URIs, ordered by first observation dateSee also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
  • 10. see also: http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
  • 11. How Much of the Web is Archived?It depends on which web…IncludingSE cacheExcludingSE Cache90% 79%97% 68%35% 16%88% 19%Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period
  • 12. Profiling Web Archive Coverage forTop-Level Domain and Content Language(submitted for publication)Ahmed AlSum, Michele C. Weigle, Michael L. Nelson,Herbert Van de Sompel
  • 13. 12 (IIPC) Archives153329 URIs from DMOZ, archive fulltext search, IA logs, Memento aggregator logs
  • 14. Temporal Spread
  • 15. Rate of Acquiring URI-Rs, URI-Ms
  • 16. TLD / Archive (DMOZ TLD sample; others similar)
  • 17. Archive / TLD Heatmap
  • 18. Using Only Top-k Archives for URI LookupYields Good ResultsEven when there are 100s of archives, we only need to talk to a few.