Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RedisConf18 - Techniques for Synchronizing In-Memory Caches with Redis


Published on

Breakout Sessions

Published in: Technology
  • Be the first to comment

RedisConf18 - Techniques for Synchronizing In-Memory Caches with Redis

  1. 1. Techniques for Synchronizing In-Memory Caches with Redis Ben Malec, Paylocity
  2. 2. Introduction and a Bit of Background
  3. 3. Simple Web Farm Web Server #1 SQL database Web Server #2 Web Server #3 Web Server #4
  4. 4. Data access frequency Datachangefrequency Rarely changed, rarely accessed Often changing, often accessed Rarely changing, often accessed Often changing, rarely accessed Cacheability Quadrants
  5. 5. Web Server #1 SQL database in-process cache Web Server #2 in-process cache Web Server #3 in-process cache Web Server #4 in-process cache Web Farm with In-Process Caches
  6. 6. Cache Data Lag Data stored in the in-process caches lags behind the source of truth. • Bad user experience • Usual solution is to shorter cache expiration times, but it’s just a trade-off • Doesn’t eliminate the problem, only reduces the length of the lag • Shorter cache times mean more database hits • What would be great is push notification from the source of truth • But it’s not straightforward to implement push notifications from a SQL backend • Easy to flood the network with sync traffic
  7. 7. Cross-Server Cache Data Inconsistency Over time the absolute expiration time of any specific key falls out of sync, meaning the data changes depending on which box serves the request. • Even worse user experience • Another non-trivial issue to solve to solve • Need to implement two-way communication between all nodes • Difficult to resolve who wins when multiple nodes update the same value at the same time • Easy to flood network with sync messages
  8. 8. Cache Stampede If every server has its own copy of cached data, then every server needs to refresh it, too • Bad at process start-up, or if multiple servers have close expiration times • Really bad during pooled deploys when an entire pool comes up • Problem continues to grow as the number of servers increases • Easy to overload back end with requests (are you noticing a trend here?)
  9. 9. One Solution: Shared Redis Cache Web Server #1 Web Server #2 Web Server #3 Web Server #4 SQL database
  10. 10. Shared Redis cache Classic Redis use case with lots of advantages: • Solves the data consistency problem completely. • Reduces cache data lag with a write-through cache implementation (but watch out for DBAs with ad-hoc scripts!) • No more cache stampede at the database (only need to update Redis once, regardless how many clients) But… • Now we have a TCP roundtrip per cache access • While Redis is incredibly fast, local RAM access is still many times faster than network I/O and a deserialization step
  11. 11. Web Server #1 SQL database in-process cache Web Server #2 in-process cache Web Server #3 in-process cache Web Server #4 in-process cache Kevin Says, “What if…”
  12. 12. Oh, and By the Way… 1. Solve the data consistency issue 2. Eliminate any data lag between the in-process caches and Redis 3. Don’t blow up the network! We haven’t been able to solve these problems effectively in the past, but what about now that Redis is part of the infrastructure?
  13. 13. Slowly the Pieces Fall Into Place… First piece to fall in place is Redis Pub/Sub for inter-server communications • Trivial to implement • Just plain works • Can use to both synchronize nodes (maintain consistency) and push changes (minimize lag) to the client. • But what data to send without saturating the network?
  14. 14. Approach #1: Broadcast all data changes to all nodes • Yes, you will blow up the network • Low efficiency, all nodes receive changed data even if they’ll never use it. • Lots of challenges around ensuring all nodes have identical data when multiple nodes update the same key at near- identical times. Solving the Data Consistency Problem
  15. 15. Approach #2: Broadcast the key that changed to all nodes • Less network traffic than sending key/value pairs • Solves the consistency issue of broadcasting values because we’re just telling nodes to hit Redis the next time the key is accessed. • But Redis’ flexibility works against us here: keys can be up to 512MB, opening up the possibility of blowing up the network just broadcasting keys. Solving the Data Consistency Problem
  16. 16. Approach #3: Instead of broadcasting keys to all nodes, why not partition keys into 16,384 buckets and just broadcast the 16-bit bucket ID? • Inspired by Redis cluster hash slot implementation • Short, fixed size synchronization messages, regardless of key size • No value synchronization issues, just tell each client to hit Redis next time a key matching the hash slot is requested. • Now, just need to implement ;-) Solving the Data Consistency Problem
  17. 17. Still, there’s a lot of things to solve: • Since we’ve grouped keys together by their hash slot, when we need to evict a key we actually will evict all keys sharing the same hash slot • An obvious solution would be to evict all values from the local cache whose key falls in the same hash slot • But that’s not practical, would have to scan all cache keys and calculate their hash slot Implementation Challenges
  18. 18. The approach Paylocity arrived at has three main features: • A dictionary of hash slots and the timestamp when a key in that hash slot was updated (the lastUpdated dictionary) • Items written to the in-process cache include the key’s hash slot and the timestamp when the object written to the in- process cache • Whenever a value is updated, a sync message containing the updated hash slot is published via Redis pub/sub Paylocity’s Solution
  19. 19. Key Value HashSlot 14587 Timestamp 150938476 Value <object> App:Employee:1736 HashSlot 1228 Timestamp 163827634 Value <object> App:Employee:2367 HashSlot 9036 Timestamp 180985776 Value <object> App:Employee:3123 HashSlot 1231 Timestamp 179872198 Value <object> App:Employee:4273 In-Process Cache lastUpdated Dictionary Hashslot Timestamp 173658476 163827634 163928374 180028372 1227 1228 1229 1230 ⋮⋮ ⋮⋮ Paylocity’s Solution
  20. 20. Additionally, a Redis pub/sub message hander listens for synchronization messages • Whenever a sync message is received, the hash slot entry in the lastUpdated dictionary will be updated with the current timestamp • When retrieving data from the in-process cache, compare the timestamp in the cache entry with the timestamp in the lastUpdated dictionary. If the lastUpdated timestamp is greater than the cache entry, the entry is out-of-date and should be discarded. Here’s the flow in the end: More Implementation Details
  21. 21. Reading a value from the cache Read entry from in-process cache Read hash slot timetamp from lastUpdated dictionary Update timestamp in lastUpdated dictionary Write entry to in-process cache Return value to client Calculate key hash slot Does key exist in the in-process cache? Is lastUpdated dictionary timestamp greater than the cache entry timestamp? Read value from Redis Yes No Yes No
  22. 22. Adding a value to the cache Calculate key hash slot Get current timestamp Add timestamp to lastUpdated dictionary Write entry (with timestamp and hash slot) to the in-process cache Write key/value to Redis Publish update message to all clients Hash slot exists in the lastUpdated dictionary? No Yes
  23. 23. Most scenarios are solved by broadcasting only keys and using Redis as a single source of truth for the cache, but not all: • Trickiest situation occurs when a node receives a key invalidation message just after it writes to Redis. Who actually won? • Must prevent state where a local node believes it has the correct data in its in-process cache, but actually doesn’t • Could implement a master clock, but that introduces a bottleneck as well as a single point of failure • Could use a distributed lock algorithm like Redlock Still Some Timing Issues Remain
  24. 24. Timing Challenges ConcurrencyConsistency
  25. 25. Instead of distributed locks or a master clock, exploit order-of- operation • Leverage the fact that Redis is the source of truth for this cache • Deceptively simple, high concurrency • Update Redis, then publish the sync message • No possibility of a client being notified before the Redis value was updated • Always grab the current timestamp before writing to Redis, the in- process cache, or the lastUpdated dictionary • Eliminate the possibility that we store a timestamp that’s more recent than the actual time we wrote the value Still, one corner case exists… Order of Operation
  26. 26. RedisMultilevelCacheClient app Stopwatch Timestamp increments Add(key, value) GetTimestamp() HandleSyncMessage() UpdateRedis, in-process cache, expiration dictionary Timestamp increments One Last Corner Case…
  27. 27. Underlying problem is that we’re using an incrementing timestamp to determine the order operations occurred • Can’t measure something smaller than the resolution of the measuring device! • Easier to visualize if you imagine the timer resolution to be a minute • In practice, not an issue because it would require a Redis write, sync message publish, and sync message handling within ~300 nanoseconds • But still, don’t want to leave known issues open In the end, just need to subtract 1 from timestamp obtained at the start of the operation (!!!) • This effectively forces the client to immediately re-read the value from Redis in cases where the timer resolution prevents us from actually determine which operation occurred first …With an Ultimately Simple Solution
  28. 28. • Early testing revealed many more hits to Redis than predicted • Root cause was that the clients were processing the update pub/sub message they published, causing them to re-read the value they just wrote to Redis • Solution was to have the pub/sub handler ignore messages that originated from the same cache provider instance • Updating Redis consists of executing two Redis commands, one to update the value, and the second to notify other clients of the change • But we don’t want to incur two TCP roundtrip. • Lua scripting to the rescue! And a Few Final Optimizations
  29. 29. Unresolved Concerns and Future Plans • Potential for cache thrashing since groups of keys are evicted by hash slot • So far no issues at Paylocity with this • Could expand the number of hash slots to better separate keys • Current don’t handle a lost sync message well • Essentially works an independent in-process cache • Leverage Redis Pub/Sub to collect and publish client hit/miss metrics • Not too difficult to get this data into the ELK stack • Implement XFETCH to optimize cache reload • Support more Redis data types!
  30. 30. • Example implementation: • Questions? That’s All, Folks!
  31. 31. Thank you