This document discusses Napster's use of Redis for caching and the problems they encountered with Redis failover at scale. As Napster's Redis data grew to tens of gigabytes, failover times increased to over 15 minutes, disrupting their API services. Napster addressed this by implementing connection pooling with ioredis to better handle connections during failovers, configuring Redis for availability rather than consistency, and architecting their systems to gracefully handle temporary Redis outages.
4. ‹#›
● Rhapsody rebranded as Napster last
Spring
● Provides on-demand and radio streaming
for mobile and desktop apps
● Powers on-demand streaming for apps like
iHeartRadio
The cat is back!
6. ‹#›
NAPSTER API SNAPSHOT
● API Gateway Layer
● 1k developers using the API
● 70m request/day
● 7k Redis ops/sec
7. ‹#›
We LOVE Redis (mostly)
● Fast! - Response times <10ms to Redis cluster
with network round trip included.
● Simple - Built in data types translate easily into
JS. Replication comes free.
● Available - Redis is mission critical for us. When
it’s down, we’re down.
10. So What’s The Problem?
10
● Redis server and sentinel share the same host
11. So What’s The Problem?
11
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
12. So What’s The Problem?
12
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
● Sending all read traffic to slaves means that you have downtime
during failover
13. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover
13
14. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
14
15. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
15
16. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
16
17. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory (8 seconds)
7. Slave serves traffic
Steps in Failover (1GB in Memory)
17
18. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory (8 seconds)
7. Slave serves traffic
Steps in Failover (1GB in Memory)
18
Total Time: ~1.5 minutes
19. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (40 seconds)
5. Master syncs data to existing slaves (122 seconds)
6. Data is loaded into memory (43 seconds)
7. Slave serves traffic
Steps in Failover (5GB in Memory)
19
Total Time: 3 minutes
20. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (181 seconds)
5. Master syncs data to existing slaves (305 seconds)
6. Data is loaded into memory (238 seconds)
7. Slave serves traffic
Steps in Failover (20GB in Memory)
20
Total Time: ~12.5 minutes
21. 1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (243 seconds)
5. Master syncs data to existing slaves (425 seconds)
6. Data is loaded into memory (354 seconds)
7. Slave serves traffic
Steps in Failover (40GB in Memory)
21
Total Time: ~18 minutes
24. 1. What is causing the failover?
2. Why is the data growing so quickly?
Investigation
24
25. 1. Out of memory
1. What’s causing the failover?
25
26. 1. Out of memory
2. Saturated client connections
1. What’s causing the failover?
26
27. 1. Out of memory
2. Saturated client connections
3. Gremlins
1. What’s causing the failover?
27
28. 1. Can you control the growth
of data?
2. If you can’t control it, at least
monitor it!
3. Think about data in terms of
volatile vs non-volatile
2. Why is the data growing so quickly?
28
29. 1. Connection Pooling!
a. https://github.com/luin/ioredis
2. Fast fail if connection is not ready
3. Backoff strategy for retry
3. How can we be better clients of Redis?
29
32. Tuning ioredis Config
32
1. keepAlive - 0 (by default) enable connection pooling to redis
2. connectTimeout - milliseconds before a timeout occurs during the
initial connection to the Redis server
3. enableReadyCheck - wait for server to load database from disk before
sending commands
4. retryStrategy - wait an increasing amount of time with each connection
attempt.
33. 1. Volatile vs non-volatile
a. Are you setting a ttl on keys?
2. What data is accessed the most?
4. Build your redis env around your data
33
Issues my team faced while scaling redis in production
Address the elephant in the room
How we use Redis
I work on the team that provides the public facing API for napster
We use redis to store information about our developers and to authenticate our users
70 million that fall through the cache
We store data about our developers, some user data, but mostly token sets as part of the Oauth flow
If you lose one, you lose the other.
You are subject to the 28K port limit
A quorum of 3 when you only have 4 sentinels can delay the time it takes to elect a new master.
Once the new master is elected, it can immediately handle writes
The default of 30 seconds allows for network hiccups and any other event that might trigger an unnecessary failover.
We’ve tried to tune this down to decrease overall failover time and if it’s too short it becomes too sensitive
When developing with small data sets it’s almost unnoticeable
Authenticated calls are failing
Some health checks are failing
By the time you have been alerted and look at the problem it’s fixed itself
Unacceptable amount of downtime
A restart won’t do anything for you. You are at the mercy of the time it takes to sync.
- Can anyone else who has been on call relate?
There is a linear correlation between data growth and the time it takes a slave to recover and become readable.
BGSAVE doubles memory
Perfect storm of connections piling up, bgsave memory issue and tokens not expiring fast enough
The dust has settled and now it’s time to investigate the issue
Set a maxmemory and key expire policy
Key expiry policy only works for ephemeral data or if you are willing to lose persisted data
Make sure your app/client is not making a bad problem worse for redis by re-establishing connections as soon as they fail
Systems will fail, so building redundancy into critical systems is essential
We are at the mercy of our client’s implementation of Oauth
Monitoring usage allows us to proactively reach out to developers so they understand how the API should be used and we don’t have to store extra data
We found a client was requesting a new Auth Token before each authenticated call
We have to allow all new token sets in and don’t have a way of eagerly expiring old refresh tokens
Developer data has to stay, ephemeral data like refresh tokens can go
Switched NPM packages to ioredis and have never looked back
There was a bug in our old package where it wouldn’t kill the old connection after a failed redis lookup
Hit 28K port limit during redis outage
Finally, some code!
Create a global client referenced in the function to create a JS singleton
Finally, some code!
Create a global client referenced in the function to create a JS singleton
Ensures any place we require redis throughout the app is using the same connection
Key Configuration: `role: master`
These configs are helpful during problem or outage situations
Enable offline queue is dangerous for us - the only time we are offline is during an outage, so queueing up requests is not doing us any favors
Retry Strategy: Good for network outages or failovers
Redis is so fast and flexible, you may not consider volatility vs space issues
We we’re storing critical data with ephemeral data
The speed is not too shaby either: we can still auth a user in <50ms with backend roundtrip included. We traded some performance, but not too much
No redis downtime since split
Easy upgrades (30 second failover)
You can go to order.napster.com/developer and get a free 6 month trial of Napster.
Build an app with our APIs and then tweet at us, we would love to see what you come up with!