ElasticCache/Redis
Part 2
Goal
Improve the stability and performance of our ElasticCache/Redis clusters by
introducing
● Background context not in AWS documentation
● Numbers for capacity/performance estimation
● Metrics/signals to pay attention to
● Anti-patterns and solutions not covered in part 1
Sharding
● Redis cluster itself does NOT use consistent hashing. This means high cost
when you resize the cluster - Think carefully before you do it!
● For estimation, 1TB resharding takes about 30 minutes
● However, consistent hashing is available through redis clients and proxies
Eviction and Memory usage
● Redis uses a timed + lazy eviction policy. Every 100ms it will randomly select
some keys and deleted expired ones. It will also delete expired keys when
fetched.
● mem_fragmentation_ratio = used_memory_rss / used_memory. < 1 means
not enough memory and using virtual memory. Higher number means more
fragments. Try to keep it around 1.03
Rule of thumb
● Keep string size < 10kb, and size of a collection < 5000
● 99% percentile latency >= 5ms most likely means problem, either with the
cluster or with the code.
● Each Redis node handle 20k rps and 20k connections easily.
● Redis using < 60% of total memory is in general fine.
● Cache hit rate < 70% is often a problem
Remember redis is single threaded
● KEYS commands should be rarely if at all used - A careless scan will block
the whole thread and thus other requests! Use SCAN instead
● For similar reason, DEL on huge collections should be avoided. In fact, use of
collections should be rare and carefully reviewed
Why we need cache-aside
If we update DB and then cache:
● What if two threads update the same entries
● Frequent write case. We update cache more than it is read
If we delete cache and then update DB:
● What if one thread writes and another one reads between the cache deletion
and new db write?
Updating a hot key
● Overall goal is to avoid stampede on the main DB
● Use a separate process just to update keys. The update gets triggered
periodically or when a cache is about to happen
● Alternatively, the service processes need to acquire a mutex before
proceeding. Processes failing to acquire mutex will retry the fetch-update.

Redis part 2

  • 1.
  • 2.
    Goal Improve the stabilityand performance of our ElasticCache/Redis clusters by introducing ● Background context not in AWS documentation ● Numbers for capacity/performance estimation ● Metrics/signals to pay attention to ● Anti-patterns and solutions not covered in part 1
  • 3.
    Sharding ● Redis clusteritself does NOT use consistent hashing. This means high cost when you resize the cluster - Think carefully before you do it! ● For estimation, 1TB resharding takes about 30 minutes ● However, consistent hashing is available through redis clients and proxies
  • 4.
    Eviction and Memoryusage ● Redis uses a timed + lazy eviction policy. Every 100ms it will randomly select some keys and deleted expired ones. It will also delete expired keys when fetched. ● mem_fragmentation_ratio = used_memory_rss / used_memory. < 1 means not enough memory and using virtual memory. Higher number means more fragments. Try to keep it around 1.03
  • 5.
    Rule of thumb ●Keep string size < 10kb, and size of a collection < 5000 ● 99% percentile latency >= 5ms most likely means problem, either with the cluster or with the code. ● Each Redis node handle 20k rps and 20k connections easily. ● Redis using < 60% of total memory is in general fine. ● Cache hit rate < 70% is often a problem
  • 6.
    Remember redis issingle threaded ● KEYS commands should be rarely if at all used - A careless scan will block the whole thread and thus other requests! Use SCAN instead ● For similar reason, DEL on huge collections should be avoided. In fact, use of collections should be rare and carefully reviewed
  • 7.
    Why we needcache-aside If we update DB and then cache: ● What if two threads update the same entries ● Frequent write case. We update cache more than it is read If we delete cache and then update DB: ● What if one thread writes and another one reads between the cache deletion and new db write?
  • 8.
    Updating a hotkey ● Overall goal is to avoid stampede on the main DB ● Use a separate process just to update keys. The update gets triggered periodically or when a cache is about to happen ● Alternatively, the service processes need to acquire a mutex before proceeding. Processes failing to acquire mutex will retry the fetch-update.