2. Goal
Understand common Elasticcache problems a platform engineer would receive
from application developers, including
1. Unexpected behavior
2. Performance
3. Cluster stability
4. HA/DR
3. Common problems in the application code
1. Not following cache-aside pattern - cached data becomes stale
2. Redis call embedded in @Transactional - adds 1ms to 2ms latency to the DB
transaction
3. Write huge key (> 1M) - cause p95 latency spike during MIGRATE call, which
is synchronous
4. Forget to update cache as empty even if the db has no value - can cause
cache penetration
5. No reconcile logic to defend against data inconsistency
4. Cache penetration
Most traffic still hits DB, because the read value does not exist in cache.
Solution:
1. Update cache with empty value if DB has no such value
2. Put a bloom filter in front of the cache, to filter out values that does not exist
for sure.
3. During load testing, make sure the service is still available even if cache
penetration happens.
5. Cache stampede
Much traffic hits DB, because a cache server just restarted or many values
expired around the same time
Solution:
1. Added a randomized slack to timeout
2. Make sure service is still available even if cache stampede happens
3. Sharding to limit the impact of 1 instance’s failure
6. Distributed lock
A common use case, but we can not treat most distributed locks as safe as locks
within the same process
1. The shift of node’s time (e.g., by NTP), may cause redis to expire the lock key
early
2. Need to record lock holder to avoid unlocking by mistake
3. The naive SET NX implementation is not reliable when a slave is promoted to
master
7. Sharding
Similar to all systems with fixed number of shards upfront (e.g., Kafka), try to avoid
online resharding. Instead, leave margin for shards during capacity planning
8. Metrics to monitor
1. In general redis is more network/memory bound than CPU bound.
2. Try to maintain > 70% cache hit rate
3. Use datadog’s ElasticCache dashboard as the starting point.