Your SlideShare is downloading. ×
0
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
More Archives, More Better
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

More Archives, More Better

891

Published on

Memento Aggregator Update at IIPC 2013 Ljubljana Slovenia April 23, 2013 http://netpreserve.org/general-assembly/2013/

Memento Aggregator Update at IIPC 2013 Ljubljana Slovenia April 23, 2013 http://netpreserve.org/general-assembly/2013/

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
891
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. More Archives, More BetterMichael L. NelsonOld Dominion Universityws-dl.blogspot.comIIPC General AssemblyLjubljana, SloveniaApril 23, 2013
  • 2. Three Easy Pieces• "An Evaluation of Caching Policies for Memento TimeMaps"– 4000 aggregated TimeMaps downloaded daily for 3 months– 20% of the time the TimeMaps shrink• "How Much of The Web Is Archived?"– 4000 URIs, 9 archives, 3 search engines– 16% -- 79% of the web archived• "Profiling Web Archive Coverage for Top-Level Domain andContent Language"– 153329 URIs, 12 archives– querying only top 3 archives gives a complete TimeMap84% of the time (52% of the time even if you exclude the IA)
  • 3. An Evaluation of Caching Policiesfor Memento TimeMapsJCDL 2013Justin Brunelle, Michael L. Nelson
  • 4. Mean # Mementos per TimeMap per Daydownload the same 4000 TimeMaps everydayODU OSupgradeIA API changesODU power outage
  • 5. Frequency of TimeMap changes over 92 days
  • 6. Optimal TimeMap Cache TTL=15 daysminimizes queries to archives, minimizes "lost" mementos*days,will only cache new TimeMap if it is "bigger"question: can we do this adaptively?
  • 7. How Much of The Web Is Archived?JCDL 2011Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen,Michele C. Weigle, Michael L. Nelson
  • 8. Public Archives, ca. late 2010 / early 2011Three categories of archives• Internet ArchiveInternet Archive (classic interface)(classic interface)• Search engineSearch engine• Other archivesOther archivesUK US
  • 9. 1000 URIs, ordered by first observation dateSee also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
  • 10. see also: http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
  • 11. How Much of the Web is Archived?It depends on which web…IncludingSE cacheExcludingSE Cache90% 79%97% 68%35% 16%88% 19%Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period
  • 12. Profiling Web Archive Coverage forTop-Level Domain and Content Language(submitted for publication)Ahmed AlSum, Michele C. Weigle, Michael L. Nelson,Herbert Van de Sompel
  • 13. 12 (IIPC) Archives153329 URIs from DMOZ, archive fulltext search, IA logs, Memento aggregator logs
  • 14. Temporal Spread
  • 15. Rate of Acquiring URI-Rs, URI-Ms
  • 16. TLD / Archive (DMOZ TLD sample; others similar)
  • 17. Archive / TLD Heatmap
  • 18. Using Only Top-k Archives for URI LookupYields Good ResultsEven when there are 100s of archives, we only need to talk to a few.

×