• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
An Evaluation of Caching Policies for Memento TimeMaps
 

An Evaluation of Caching Policies for Memento TimeMaps

on

  • 1,014 views

JCDL2013 presentation by Justin F. Brunelle

JCDL2013 presentation by Justin F. Brunelle

Statistics

Views

Total Views
1,014
Views on SlideShare
302
Embed Views
712

Actions

Likes
0
Downloads
0
Comments
0

31 Embeds 712

http://ws-dl.blogspot.com 522
http://ws-dl.blogspot.in 29
http://ws-dl.blogspot.de 22
http://ws-dl.blogspot.nl 17
http://ws-dl.blogspot.co.uk 17
http://ws-dl.blogspot.ru 17
http://ws-dl.blogspot.it 12
http://ws-dl.blogspot.ca 10
http://ws-dl.blogspot.gr 9
http://ws-dl.blogspot.com.au 8
http://ws-dl.blogspot.fr 7
http://ws-dl.blogspot.sg 5
http://ws-dl.blogspot.com.es 4
http://ws-dl.blogspot.ch 3
http://ws-dl.blogspot.pt 3
http://cloud.feedly.com 3
http://ws-dl.blogspot.kr 3
http://ws-dl.blogspot.se 2
https://twitter.com 2
http://ws-dl.blogspot.cz 2
http://ws-dl.blogspot.be 2
http://ws-dl.blogspot.com.ar 2
http://ws-dl.blogspot.fi 2
http://ws-dl.blogspot.jp 2
https://www.google.com 1
http://www.ws-dl.blogspot.com 1
http://ws-dl.blogspot.hk 1
http://newsblur.com 1
http://ws-dl.blogspot.com.br 1
http://ws-dl.blogspot.co.at 1
http://ws-dl.blogspot.co.nz 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Mention here that timemaps are cached to save from overburdening the archives and improve response-time to the users.
  • CNN.com: 15878 Google.com: 27540
  • For two extremes, make note of the expense: Never caching means we have to request timemaps of the archives and build the aggregator each HTTP GET for the aggregator timegate (meaning it will only operate as fast as the slowest responding archive). However, this gives us the freshest results. Never replacing in the cache means we might have the stale-est results but save load on the archives. This method also has the highest potential to cache transient errors forever.
  • |a| is the number of contributing archives |m| is the number of unique mementos listed in the timemap. For simplicity and the time being, let’s refer to a unique memento as a single observation of a URI-R at a point in time.
  • Vast majority of timemaps don’t change. When they do, they often change because an archive isn’t reporting its mementos. Rarely does something “strange” happen.
  • Explain transient error vs. not archived
  • MemDays is needed because a TimeMap that misses 2 mementos per day for 10 days is not the same as one that misses 100 per day, and is better than one that misses 1,000 once.
  • Optimal ttl at intersection of memdays and misses (requests to archives)
  • 3 months probably isn’t enough data The new Memento landscape has changed – there are new archives, the IA publishes mementos with increased frequency, more archives are memento compliant. This makes the study worth investigating again.
  • Pages change over time

An Evaluation of Caching Policies for Memento TimeMaps An Evaluation of Caching Policies for Memento TimeMaps Presentation Transcript

  • An Evaluation of Caching Policies for Memento TimeMaps Justin F. Brunelle and Michael L. Nelson Old Dominion University {jbrunelle, mln}@cs.odu.edu JCDL 2013 Indianapolis, Indiana 07/2013
  • Discovering Archived nasa.gov Pages Archived Pages => mementos Mementos identified by URI-M Live Pages => resources Resources identified by URI-R 2
  • 3
  • TimeMaps: Lists of mementos <http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original", <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT", <http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT", <http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT", <http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT", <http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT", <http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT", <http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT", <http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT", <http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT", <http://api.wayback.archive.org/memento/20080903053412/http://www.nasa.gov/>;rel="memento";datetime="Wed, 03 Sep 2008 05:34:12 GMT", <http://webarchive.nationalarchives.gov.uk/20080904014810/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 00:00:00 GMT", <http://api.wayback.archive.org/memento/20080904055742/http://www.nasa.gov/>;rel="memento";datetime="Thu, 04 Sep 2008 05:57:42 GMT", <http://webarchive.nationalarchives.gov.uk/20080906134025/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 00:00:00 GMT", <http://api.wayback.archive.org/memento/20080906143204/http://www.nasa.gov/>;rel="memento";datetime="Sat, 06 Sep 2008 14:32:04 GMT", <http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT", <http://api.wayback.archive.org/memento/20080907160232/http://www.nasa.gov/>;rel="memento";datetime="Sun, 07 Sep 2008 16:02:32 GMT", <http://webarchive.nationalarchives.gov.uk/20120809003120/http://www.nasa.gov/>;rel="memento";datetime="Thu, 09 Aug 2012 00:00:00 GMT", <http://webarchive.nationalarchives.gov.uk/20120814175606/http://www.nasa.gov/>;rel="memento";datetime="Tue, 14 Aug 2012 00:00:00 GMT", <http://webarchive.nationalarchives.gov.uk/20120819212348/http://www.nasa.gov/>;rel="memento";datetime="Sun, 19 Aug 2012 00:00:00 GMT", <http://webarchive.nationalarchives.gov.uk/20120826185010/http://www.nasa.gov/>;rel="memento";datetime="Sun, 26 Aug 2012 00:00:00 GMT", <http://webarchive.nationalarchives.gov.uk/20120909230516/http://www.nasa.gov/>;rel="last memento";datetime="Sun, 09 Sep 2012 00:00:00 GMT" <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/> ;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT" http://webarchive.nationalarchives.gov.uk/20080907124040/http://www.nasa.gov/ ;rel="memento";datetime="Sun, 07 Sep 2008 00:00:00 GMT", 4
  • Aggregating TimeMapes • Multiple archives • Expensive • Caching reduces load on archives • Write-through Cache Aggre- gator Sort IA TM AIT TM HTTP Cache … 5
  • Aggregator Cache • TimeMaps change • Only want to cache better TimeMaps – Bigger is better • Ideally monotonically increasing • Two extremes: – Never cache (TTL=0) – Never update in cache (TTL=92) 6
  • Agenda 7
  • Cache content measures • |a| => # of archives <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/> ;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”, • |m| => # of mementos <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/> ;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT”, 8
  • Same TimeMap • |a| == |a'| • |m| == |m'| All archives have reported the same mementos. TimeMap T 9 mm mm mm TimeMap T' mm mm mm |a| = 2; |m| = 3 |a| = 2; |m| = 3
  • Gained Archives, Gained Mementos • |a| < |a`| • |m| < |m`| A new archive (WebCite) has just indexed and reported a memento for the first time. 10 TimeMap T mm mm mm TimeMap T' mm mm mm mm |a| = 2; |m| = 3 |a| = 3; |m| = 4
  • • |a| == |a`| • |m| < |m`| The Internet Archive has released a set of new mementos. 11 TimeMap T mm mm mm TimeMap T' mm mm mm mm Same Archives, Gained Mementos |a| = 2; |m| = 3 |a| = 2; |m| = 4
  • Lost Archives, Same Mementos • |a| > |a`| • |m| == |m`| A redaction of 1 memento took place in the Internet Archive which now does not report mementos for this resource. The UK Web Archive has released 1 new memento for this resource. 1212 TimeMap T ' mm mm mm TimeMap T mm mm mm |a| = 3; |m| = 3 |a| = 2; |m| = 3
  • Lost Archives, Gained Mementos • |a| > |a`| • |m| < |m`| A redaction of 2 mementos took place in the Internet Archive which now does not report mementos for this resource. The UK Government Web Archive has released 3 new mementos for this resource. 13 TimeMap T mm mm mm TimeMap T' mm mmmm mm |a| = 2; |m| = 3 |a| = 1; |m| = 4
  • Lost Archives, Lost Mementos • |a| > |a`| • |m| > |m`| Archive-It has removed a collection, and no longer reports those mementos. No other archives have new mementos of those resources. 14 TimeMap T mm mm mm TimeMap T' mm |a| = 2; |m| = 3 |a| = 1; |m| = 1
  • Gained Archives, Lost Mementos • |a| < |a`| • |m| > |m`| A new archive (WebCite) has just indexed and reported 1 memento for the first time. A server error at the Internet Archive caused an omission of 2 mementos. 15 TimeMap T mm mm mm |a| = 2; |m| = 4 TimeMap T' mm mm mm |a| = 3; |m| = 3 mm
  • Agenda 16
  • Experiment Design • Eliminate caching from local Memento proxies • Daily observations of 4,000 TimeMaps for 92 days in 2013 • TimeMaps analyzed for changes & cardinality • Investigated caching policies • Outages observed from Memento/archives/department 17
  • Observations Occurrence Description Action 77.4% Unchanged TimeMap Do not update cache 19.7% Lost archives, lost mementos Do not update cache 2.4% Gained archives, gained mementos Update cache 0.4% Same archives, gained mementos Update cache 0.1% Gained archives, lost mementos Do not update cache 0.01% Lost archives, same mementos Update cache 0.01% Lost archives, gained mementos Update cache 18
  • Impact of Change in TimeMaps • Caching transient errors – Not returned or not archived? 19
  • Cardinality of TimeMaps <http://mementoproxy.lanl.gov/aggr/timegate/http://www.nasa.gov/>;rel="timegate", <http://www.nasa.gov/>;rel="original", <http://api.wayback.archive.org/memento/19961231235847/http://www.nasa.gov/>;rel="first memento";datetime="Tue, 31 Dec 1996 23:58:47 GMT", <http://api.wayback.archive.org/memento/19970605230559/http://www.nasa.gov/>;rel="memento";datetime="Thu, 05 Jun 1997 23:05:59 GMT", <http://api.wayback.archive.org/memento/19970711094601/http://www.nasa.gov/>;rel="memento";datetime="Fri, 11 Jul 1997 09:46:01 GMT", <http://api.wayback.archive.org/memento/19981202170636/http://www.nasa.gov/>;rel="memento";datetime="Wed, 02 Dec 1998 17:06:36 GMT", <http://api.wayback.archive.org/memento/19981212031235/http://www.nasa.gov/>;rel="memento";datetime="Sat, 12 Dec 1998 03:12:35 GMT", <http://api.wayback.archive.org/memento/19990116233500/http://nasa.gov/>;rel="memento";datetime="Sat, 16 Jan 1999 23:35:00 GMT", <http://api.wayback.archive.org/memento/19990117063022/http://nasa.gov/>;rel="memento";datetime="Sun, 17 Jan 1999 06:30:22 GMT", <http://api.wayback.archive.org/memento/19990125091025/http://nasa.gov/>;rel="memento";datetime="Mon, 25 Jan 1999 09:10:25 GMT", <http://api.wayback.archive.org/memento/19990203005545/http://nasa.gov/>;rel="memento";datetime="Wed, 03 Feb 1999 00:55:45 GMT", … |TM| ? 20
  • Strict vs. Loose Matching • Different archive, URI-M, datetime- Strict: 2, Loose: 2 <http://api.wayback.archive.org/memento/20080509125659/http://flare.prefuse.org/>;rel="memento"; datetime="Fri, 09 May 2008 12:56:59 GMT", <http://webarchive.nationalarchives.gov.uk/20080908074106/http://flare.prefuse.org/>;rel="memento"; datetime="Mon, 08 Sep 2008 00:00:00 GMT", • Same archive, datetime, different URI-M- Strict: 3, Loose: 1 <http://web.archive.org/web/20101101060204/http://aarp.org:80/Health/>;rel="memento"; datetime="Mon, 01 Nov 2010 06:02:04 GMT", <http://web.archive.org/web/20101101060204/http://www.aarp.org:80/Health/>;rel="memento"; datetime=“Mon, 01 Nov 2010 06:02:04 GMT", <http://web.archive.org/web/20101101060204/http://www.aarp.org:80/health/>;rel="memento"; datetime=“Mon, 01 Nov 2010 06:02:04 GMT", • Same archive, different URI-M, bad datetime- Strict: 2, Loose: 2 <http://wayback.archive-it.org/2342/20110321192906/http://www.apple.com/iphone/find-my-iphone- setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT" <http://wayback.archive-it.org/2354/20110321035356/http://www.apple.com/iphone/find-my-iphone- setup/>...datetime="Mon, 21 Mar 2011 00:00:00 GMT" 21
  • Strict vs. Loose: translate.google.com 22
  • Agenda 23
  • Testing • TTLs [0, 92] – 0: Thrashed cache, best freshness – 92: First TimeMap cached, no replacement • Policies – Unconditional • Cardinality ignored – Conditional • Replacements occur when cardinality is better 24
  • Evaluation • Minimize cost values: – Q – Queries to the archives – MemDays – number of missed mementos/day • Calculated MemDays: mementos missed/day TTL: ∞ TTL: 0 MemDays Q 25
  • MemDays 26 6 |TM|=10 MemDay =8
  • Optimal TTLUnconditional Conditional Optimal TTL= 9 Optimal TTL= 15 27
  • Agenda 28
  • Conclusion & Future Work • 3-month observation of 4,000 TimeMaps • Change patterns studied – 80.2% of TimeMaps monotonically increase – Others decrease • Optimal TTL = 15 days • Cache Improvements: – Saves requests to the archives • Worth reinvestigating – Changed Memento landscape 29
  • Backups 30
  • www.nasa.gov 1996 - 2012 31
  • Memento Integrates the past and present web Now Always Current 2008 2006 200120082010 32
  • 33
  • Cardinality • Size of a TimeMap – # Archives? – # Date times? • TimeMaps: • Cardinality: • Monotonic Increase: 34