Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch

648 views

Published on

RedisConf17 breakout session

Published in: Technology

RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch

  1. 1. Preventing cache stampede with Redis & XFetch Jim Nelson <jnelson@archive.org> Internet Archive RedisConf 2017
  2. 2. Internet Archive Universal Access to All Knowledge Founded 1996, based in San Francisco Archive of digital and physical media Includes Web, books, music, film, software & more Digital holdings: over 30 petabytes & counting Key collections & services: Wayback Machine Grateful Dead live concert collection
  3. 3. Internet Archive ♡ Redis Caching & other services backed by 10-node sharded Redis cluster Sharding performed client-side via consistent hashing (PHP, Predis) Each node supported by two replicated mirrors (fail-over) Specialized Redis instances also used throughout IA’s services, including Wayback, search, and more
  4. 4. Caching: Quick terminology I assume we all know what caching is. This is the terminology I’ll use today: Recompute: Expensive operation whose result is cached (database query, file system read, HTTP request to remote service) Expiration: When a cache value is considered stale or out-of-date (time-to-live) Evict: Removing a value from the cache (to forcibly invalidate a value prior to expiry)
  5. 5. Cache stampede
  6. 6. Cache stampede “A cache stampede is a type of cascading failure that can occur when massively parallel computing systems with caching mechanisms come under very high load. This behaviour is sometimes also called dog-piling.” –Wikipedia https://en.wikipedia.org/wiki/Cache_stampede
  7. 7. Cache stampede: A scenario Multiple servers, each with multiple workers serving requests, accessing a common cached value When the cached value expires or is evicted, all workers experience a simultaneous cache miss Workers recompute the missing value, causing overload of primary data sources (e.g. database) and/or hung requests
  8. 8. Congestion collapse Hung workers due to network congestion or expensive recomputes—that’s bad Discarded user requests—that’s bad Overloaded primary data stores (“Sources of Truth”)—that’s bad Harmonics (peaks & valleys): brief periods of intense activity (mini-outages) followed by lulls—that’s bad Imagine a cached value with TTL of 1hr enjoying 10,000 hits/sec—that’s good. Now imagine @ 1hr+1sec 10,000 cache misses —that’s bad.
  9. 9. Typical cache code function fetch(name) var data = redis.get(name) if (!data) data = recompute(name) redis.set(name, expires, data) return data This “looks” fine, but consider tens of thousands of simultaneous workers calling this code at once: no mutual exclusion, no upper-bound to simultaneous recomputes or writes … that’s a cache stampede
  10. 10. Typical stampede solutions (a) Locking One worker acquires lock, recomputes, and writes value to cache Other workers wait for lock to be released, then retry cache read Primary data source is not overloaded by requests Redis is often used as a cluster-wide distributed lock: https://redis.io/topics/distlock
  11. 11. Problems with locking Introduces extra reads and writes into code path Starvation: expiration / eviction can lead to blocked workers waiting for a single worker to finish recompute Distributed locks may be abandoned
  12. 12. Typical stampede solutions (b) External recompute Use a separate process / independent worker to recompute value Workers never recompute (Alternately, workers recompute as fall-back when external process fails)
  13. 13. Problems with external recompute One more “moving part”—a daemon, a cron job, work stealing Requires fall-back scheme if external recompute fails to run External recomputation is often not easily deterministic: caching based on a wide variety of user input periodic external recomputation of 1,000,000 user records External recomputation may be inefficient if cached values are never read by
  14. 14. XFetch (Probabilistic early recomputation)
  15. 15. Probabilistic early recomputation (PER) Recompute cache values before they expire Before expiration, one worker “volunteers” to recompute the value Without evicting old value, volunteer performs expensive recompute— other workers continue reading cache Before expiration, volunteer writes new cache value and extends its time-to-live Under ideal conditions, there are no cache misses
  16. 16. XFetch Full paper title: “Optimal Probabilistic Cache Stampede Prevention” Authors: Andrea Vattani (Goodreads) Flavio Chierichetti (Sapienza University) Keegan Lowenstein (Bugsnag) Archived at IA: https://archive.org/details/xfetch
  17. 17. The algorithm XFetch (“exponential fetch”) is elegant: delta * beta * loge(rand()) where delta – Time to recompute value beta – control (default: 1.0, > 1.0 favors earlier recomputation, < 1.0 favors later) rand – Random number [ 0.0 … 1.0 ] Remember: log(0) to log(1) is negative, so XFetch produces negative value
  18. 18. Updated code function fetch(name) var data,delta,ttl = redis.get(name, delta, ttl) if (!data or xfetch(delta, time() + ttl)) var data,recompute_time = recompute(name) redis.set(name, expires, data), redis.set(delta, expires, recompute_time) return data function xfetch(delta, expiry) /* XFetch is negative; value is being added to time() */ return time() - (delta * BETA * log(rand(0,1))) >= expiry
  19. 19. Can more than one volunteer recompute? Yes. You should know this before using XFetch. It’s possible for more than one worker to “roll” the magic number and start a recompute. The odds of this occurring increase as the expiration deadline approaches. If your data source absolutely cannot be accessed by multiple workers, use a lock or another sentinel—XFetch will minimize lock contention
  20. 20. How to determine delta? XFetch must be supplied with the time required to recompute. The easiest approach is to store the duration of the last recompute and read it with the cached value.
  21. 21. What’s the deal with the beta value? beta is the one knob you have to tweak XFetch. beta > 1.0 favors earlier recomputation, < 1.0 favors later recomputation. My suggestion: Start with the default (1.0), instrument your code, and change only if necessary.
  22. 22. XFetch & Redis Let’s look at some sample code
  23. 23. Questions?
  24. 24. Redis & XFetch Jim Nelson <jnelson@archive.org> Internet Archive RedisConf 2017

×