“TOO BIG TO FAILOVER”
A cautionary tale of scaling Redis
Aaron Pollack - May 2017
Presentation Summary
2
● How redis is used at napster
● Problems with failover at scale
● Our solution for constant time failovers
‹#›
Napster is still around?
‹#›
● Rhapsody rebranded as Napster last
Spring
● Provides on-demand and radio streaming
for mobile and desktop apps
● Powers on-demand streaming for apps like
iHeartRadio
The cat is back!
5
API.NAPSTER.COM
+
‹#›
NAPSTER API SNAPSHOT
● API Gateway Layer
● 1k developers using the API
● 70m request/day
● 7k Redis ops/sec
‹#›
We LOVE Redis (mostly)
● Fast! - Response times <10ms to Redis cluster
with network round trip included.
● Simple - Built in data types translate easily into
JS. Replication comes free.
● Available - Redis is mission critical for us. When
it’s down, we’re down.
Architected for Speed
8
So What’s The Problem?
9
So What’s The Problem?
10
● Redis server and sentinel share the same host
So What’s The Problem?
11
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
So What’s The Problem?
12
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
● Sending all read traffic to slaves means that you have downtime
during failover
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover
13
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
14
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
15
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
16
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory (8 seconds)
7. Slave serves traffic
Steps in Failover (1GB in Memory)
17
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (9 seconds)
5. Master syncs data to existing slaves (39 seconds)
6. Data is loaded into memory (8 seconds)
7. Slave serves traffic
Steps in Failover (1GB in Memory)
18
Total Time: ~1.5 minutes
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (40 seconds)
5. Master syncs data to existing slaves (122 seconds)
6. Data is loaded into memory (43 seconds)
7. Slave serves traffic
Steps in Failover (5GB in Memory)
19
Total Time: 3 minutes
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (181 seconds)
5. Master syncs data to existing slaves (305 seconds)
6. Data is loaded into memory (238 seconds)
7. Slave serves traffic
Steps in Failover (20GB in Memory)
20
Total Time: ~12.5 minutes
1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE (243 seconds)
5. Master syncs data to existing slaves (425 seconds)
6. Data is loaded into memory (354 seconds)
7. Slave serves traffic
Steps in Failover (40GB in Memory)
21
Total Time: ~18 minutes
23
Slaves Become Unreachable During Failover
1. What is causing the failover?
2. Why is the data growing so quickly?
Investigation
24
1. Out of memory
1. What’s causing the failover?
25
1. Out of memory
2. Saturated client connections
1. What’s causing the failover?
26
1. Out of memory
2. Saturated client connections
3. Gremlins
1. What’s causing the failover?
27
1. Can you control the growth
of data?
2. If you can’t control it, at least
monitor it!
3. Think about data in terms of
volatile vs non-volatile
2. Why is the data growing so quickly?
28
1. Connection Pooling!
a. https://github.com/luin/ioredis
2. Fast fail if connection is not ready
3. Backoff strategy for retry
3. How can we be better clients of Redis?
29
ioredis
30
https://github.com/luin/ioredis
https://www.npmjs.com/package/ioredis
Client Singleton
31
Tuning ioredis Config
32
1. keepAlive - 0 (by default) enable connection pooling to redis
2. connectTimeout - milliseconds before a timeout occurs during the
initial connection to the Redis server
3. enableReadyCheck - wait for server to load database from disk before
sending commands
4. retryStrategy - wait an increasing amount of time with each connection
attempt.
1. Volatile vs non-volatile
a. Are you setting a ttl on keys?
2. What data is accessed the most?
4. Build your redis env around your data
33
Client Initializer
34
Architected for Availability
THANK YOU
Me:
apollack@napster.com
github.com/lolpack
lolpack.me
‹#›
Napster API Team:
@napsterAPI
Links:
White Paper: lolpack.me/rediswhitepaper.pdf
Try out Napster: order.napster.com/developer
API Docs: developer.napster.com

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

  • 1.
    “TOO BIG TOFAILOVER” A cautionary tale of scaling Redis Aaron Pollack - May 2017
  • 2.
    Presentation Summary 2 ● Howredis is used at napster ● Problems with failover at scale ● Our solution for constant time failovers
  • 3.
  • 4.
    ‹#› ● Rhapsody rebrandedas Napster last Spring ● Provides on-demand and radio streaming for mobile and desktop apps ● Powers on-demand streaming for apps like iHeartRadio The cat is back!
  • 5.
  • 6.
    ‹#› NAPSTER API SNAPSHOT ●API Gateway Layer ● 1k developers using the API ● 70m request/day ● 7k Redis ops/sec
  • 7.
    ‹#› We LOVE Redis(mostly) ● Fast! - Response times <10ms to Redis cluster with network round trip included. ● Simple - Built in data types translate easily into JS. Replication comes free. ● Available - Redis is mission critical for us. When it’s down, we’re down.
  • 8.
  • 9.
    So What’s TheProblem? 9
  • 10.
    So What’s TheProblem? 10 ● Redis server and sentinel share the same host
  • 11.
    So What’s TheProblem? 11 ● Redis server and sentinel share the same host ● Four sentinels a. An even number means that there is a chance for ties if quorum is 2
  • 12.
    So What’s TheProblem? 12 ● Redis server and sentinel share the same host ● Four sentinels a. An even number means that there is a chance for ties if quorum is 2 ● Sending all read traffic to slaves means that you have downtime during failover
  • 13.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated 3. A new slave is elected master 4. New master does full BGSAVE 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover 13
  • 14.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 14
  • 15.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 15
  • 16.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 16
  • 17.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory (8 seconds) 7. Slave serves traffic Steps in Failover (1GB in Memory) 17
  • 18.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory (8 seconds) 7. Slave serves traffic Steps in Failover (1GB in Memory) 18 Total Time: ~1.5 minutes
  • 19.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (40 seconds) 5. Master syncs data to existing slaves (122 seconds) 6. Data is loaded into memory (43 seconds) 7. Slave serves traffic Steps in Failover (5GB in Memory) 19 Total Time: 3 minutes
  • 20.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (181 seconds) 5. Master syncs data to existing slaves (305 seconds) 6. Data is loaded into memory (238 seconds) 7. Slave serves traffic Steps in Failover (20GB in Memory) 20 Total Time: ~12.5 minutes
  • 21.
    1. Master isunreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (243 seconds) 5. Master syncs data to existing slaves (425 seconds) 6. Data is loaded into memory (354 seconds) 7. Slave serves traffic Steps in Failover (40GB in Memory) 21 Total Time: ~18 minutes
  • 23.
  • 24.
    1. What iscausing the failover? 2. Why is the data growing so quickly? Investigation 24
  • 25.
    1. Out ofmemory 1. What’s causing the failover? 25
  • 26.
    1. Out ofmemory 2. Saturated client connections 1. What’s causing the failover? 26
  • 27.
    1. Out ofmemory 2. Saturated client connections 3. Gremlins 1. What’s causing the failover? 27
  • 28.
    1. Can youcontrol the growth of data? 2. If you can’t control it, at least monitor it! 3. Think about data in terms of volatile vs non-volatile 2. Why is the data growing so quickly? 28
  • 29.
    1. Connection Pooling! a.https://github.com/luin/ioredis 2. Fast fail if connection is not ready 3. Backoff strategy for retry 3. How can we be better clients of Redis? 29
  • 30.
  • 31.
  • 32.
    Tuning ioredis Config 32 1.keepAlive - 0 (by default) enable connection pooling to redis 2. connectTimeout - milliseconds before a timeout occurs during the initial connection to the Redis server 3. enableReadyCheck - wait for server to load database from disk before sending commands 4. retryStrategy - wait an increasing amount of time with each connection attempt.
  • 33.
    1. Volatile vsnon-volatile a. Are you setting a ttl on keys? 2. What data is accessed the most? 4. Build your redis env around your data 33
  • 34.
  • 35.
  • 36.
  • 37.
    ‹#› Napster API Team: @napsterAPI Links: WhitePaper: lolpack.me/rediswhitepaper.pdf Try out Napster: order.napster.com/developer API Docs: developer.napster.com

Editor's Notes

  • #2 Issues my team faced while scaling redis in production
  • #4 Address the elephant in the room
  • #6 How we use Redis I work on the team that provides the public facing API for napster We use redis to store information about our developers and to authenticate our users
  • #7 70 million that fall through the cache We store data about our developers, some user data, but mostly token sets as part of the Oauth flow
  • #11 If you lose one, you lose the other. You are subject to the 28K port limit
  • #12 A quorum of 3 when you only have 4 sentinels can delay the time it takes to elect a new master.
  • #14 Once the new master is elected, it can immediately handle writes
  • #15 The default of 30 seconds allows for network hiccups and any other event that might trigger an unnecessary failover. We’ve tried to tune this down to decrease overall failover time and if it’s too short it becomes too sensitive
  • #19 When developing with small data sets it’s almost unnoticeable
  • #20 Authenticated calls are failing Some health checks are failing By the time you have been alerted and look at the problem it’s fixed itself
  • #22 Unacceptable amount of downtime A restart won’t do anything for you. You are at the mercy of the time it takes to sync.
  • #23 - Can anyone else who has been on call relate?
  • #24 There is a linear correlation between data growth and the time it takes a slave to recover and become readable. BGSAVE doubles memory Perfect storm of connections piling up, bgsave memory issue and tokens not expiring fast enough
  • #25 The dust has settled and now it’s time to investigate the issue
  • #26 Set a maxmemory and key expire policy Key expiry policy only works for ephemeral data or if you are willing to lose persisted data
  • #27 Make sure your app/client is not making a bad problem worse for redis by re-establishing connections as soon as they fail
  • #28 Systems will fail, so building redundancy into critical systems is essential
  • #29 We are at the mercy of our client’s implementation of Oauth Monitoring usage allows us to proactively reach out to developers so they understand how the API should be used and we don’t have to store extra data We found a client was requesting a new Auth Token before each authenticated call We have to allow all new token sets in and don’t have a way of eagerly expiring old refresh tokens Developer data has to stay, ephemeral data like refresh tokens can go
  • #30 Switched NPM packages to ioredis and have never looked back There was a bug in our old package where it wouldn’t kill the old connection after a failed redis lookup Hit 28K port limit during redis outage
  • #31 Finally, some code! Create a global client referenced in the function to create a JS singleton
  • #32 Finally, some code! Create a global client referenced in the function to create a JS singleton Ensures any place we require redis throughout the app is using the same connection
  • #33 Key Configuration: `role: master` These configs are helpful during problem or outage situations Enable offline queue is dangerous for us - the only time we are offline is during an outage, so queueing up requests is not doing us any favors Retry Strategy: Good for network outages or failovers
  • #34 Redis is so fast and flexible, you may not consider volatility vs space issues We we’re storing critical data with ephemeral data
  • #36 The speed is not too shaby either: we can still auth a user in <50ms with backend roundtrip included. We traded some performance, but not too much No redis downtime since split Easy upgrades (30 second failover)
  • #38 You can go to order.napster.com/developer and get a free 6 month trial of Napster. Build an app with our APIs and then tweet at us, we would love to see what you come up with!