RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

•Download as PPTX, PDF•

1 like•713 views

This document discusses Napster's use of Redis for caching and the problems they encountered with Redis failover at scale. As Napster's Redis data grew to tens of gigabytes, failover times increased to over 15 minutes, disrupting their API services. Napster addressed this by implementing connection pooling with ioredis to better handle connections during failovers, configuring Redis for availability rather than consistency, and architecting their systems to gracefully handle temporary Redis outages.

Technology

“TOO BIG TO FAILOVER”
A cautionary tale of scaling Redis
Aaron Pollack - May 2017

Presentation Summary
2
● How redis is used at napster
● Problems with failover at scale
● Our solution for constant time failovers

‹#›
● Rhapsody rebranded as Napster last
Spring
● Provides on-demand and radio streaming
for mobile and desktop apps
● Powers on-demand streaming for apps like
iHeartRadio
The cat is back!

‹#›
NAPSTER API SNAPSHOT
● API Gateway Layer
● 1k developers using the API
● 70m request/day
● 7k Redis ops/sec

‹#›
We LOVE Redis (mostly)
● Fast! - Response times <10ms to Redis cluster
with network round trip included.
● Simple - Built in data types translate easily into
JS. Replication comes free.
● Available - Redis is mission critical for us. When
it’s down, we’re down.

So What’s The Problem?
10
● Redis server and sentinel share the same host

So What’s The Problem?
11
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2

So What’s The Problem?
12
● Redis server and sentinel share the same host
● Four sentinels
a. An even number means that there is a chance for ties if
quorum is 2
● Sending all read traffic to slaves means that you have downtime
during failover

1. Master is unreachable
2. Sentinels reach quorum and failover is initiated
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover
13

1. Master is unreachable
2. Sentinels reach quorum and failover is initiated (30 seconds)
3. A new slave is elected master
4. New master does full BGSAVE
5. Master syncs data to existing slaves
6. Data is loaded into memory
7. Slave serves traffic
Steps in Failover (1GB in Memory)
14

23
Slaves Become Unreachable During Failover

1. What is causing the failover?
2. Why is the data growing so quickly?
Investigation
24

1. Out of memory
1. What’s causing the failover?
25

1. Out of memory
2. Saturated client connections
1. What’s causing the failover?
26

1. Out of memory
2. Saturated client connections
3. Gremlins
1. What’s causing the failover?
27

1. Can you control the growth
of data?
2. If you can’t control it, at least
monitor it!
3. Think about data in terms of
volatile vs non-volatile
2. Why is the data growing so quickly?
28

1. Connection Pooling!
a. https://github.com/luin/ioredis
2. Fast fail if connection is not ready
3. Backoff strategy for retry
3. How can we be better clients of Redis?
29

ioredis
30
https://github.com/luin/ioredis
https://www.npmjs.com/package/ioredis

Tuning ioredis Config
32
1. keepAlive - 0 (by default) enable connection pooling to redis
2. connectTimeout - milliseconds before a timeout occurs during the
initial connection to the Redis server
3. enableReadyCheck - wait for server to load database from disk before
sending commands
4. retryStrategy - wait an increasing amount of time with each connection
attempt.

1. Volatile vs non-volatile
a. Are you setting a ttl on keys?
2. What data is accessed the most?
4. Build your redis env around your data
33

THANK YOU
Me:
apollack@napster.com
github.com/lolpack
lolpack.me

‹#›
Napster API Team:
@napsterAPI
Links:
White Paper: lolpack.me/rediswhitepaper.pdf
Try out Napster: order.napster.com/developer
API Docs: developer.napster.com

What's hot

Automatic Operation Bot for Ceph - You JiCeph Community

Global deduplication for Ceph - Myoungwon OhCeph Community

Solving some of the scalability problems at booking.comIvan Kruglov

Buildinga billionuserloadbalancer may2015-sre-con15europe-shuffPatrick Shuff

(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services

Running Cloud Foundry for 12 months - An experience report | anyninesanynines GmbH

[En] IPVS for Docker ContainersAndrey Sibirev

Full Stack Load Testing Terral R Jordan

Leveraging Structured Data To Reduce Disk, IO & Network BandwidthPerforce

SaltConf14 - Brendan Burns, Google - Management at Google ScaleSaltStack

Apache Traffic Serversupertom

Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Community

Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"Fwdays

Experience Report: Cloud Foundry Open Source Operations | anyninesanynines GmbH

JRuby - Everything in a single processocher

SaltConf 2014: Safety with powertoolsThomas Jackson

Chaos Engineering for DockerAlexei Ledenev

Os Webboscon2007

How Typepad changed their architecture without taking down the serviceroyans

What's hot (19)

Automatic Operation Bot for Ceph - You Ji

Global deduplication for Ceph - Myoungwon Oh

Solving some of the scalability problems at booking.com

Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff

(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014

Running Cloud Foundry for 12 months - An experience report | anynines

[En] IPVS for Docker Containers

Full Stack Load Testing

Leveraging Structured Data To Reduce Disk, IO & Network Bandwidth

SaltConf14 - Brendan Burns, Google - Management at Google Scale

Apache Traffic Server

Ceph Goes on Online at Qihoo 360 - Xuehan Xu

Алексей Петров "PHP at Scale: Knowing enough to be dangerous!"

Experience Report: Cloud Foundry Open Source Operations | anynines

JRuby - Everything in a single process

SaltConf 2014: Safety with powertools

Chaos Engineering for Docker

Os Webb

How Typepad changed their architecture without taking down the service

Similar to RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

Graph Stream Processing : spinning fast, large scale, complex analyticsParis Carbone

PhpTek Ten Things to do to make your MySQL servers Happier and HealthierDave Stokes

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil

UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey

PerformanceChristophe Marchal

Infrastructure as code might be literally impossible part 2ice799

High performance Infrastructure Oct 2013Server Density

Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...Flink Forward

Rails performance at Justin.tv - Guillaume LuccisanoGuillaume Luccisano

The care and feeding of a MySQL databaseDave Stokes

My talk at Linux Piter 2015Alex Chistyakov

Advanced Administration, Monitoring and BackupMongoDB

Failover or not to failoverHenrik Ingo

Pluk2013 bodybuilding ratheeshRatheesh Kaniyala

Sheepdog Status ReportLiu Yuan

Top 5 mistakes when writing Spark applicationsmarkgrover

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa

Low level java programmingPeter Lawrey

5 Tips for Getting Started with Pivotal GemFireVMware Tanzu

Leveraging Databricks for Spark PipelinesRose Toomey

Similar to RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis (20)

Graph Stream Processing : spinning fast, large scale, complex analytics

PhpTek Ten Things to do to make your MySQL servers Happier and Healthier

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...

UKOUG, Lies, Damn Lies and I/O Statistics

Performance

Infrastructure as code might be literally impossible part 2

High performance Infrastructure Oct 2013

Flink Forward Berlin 2018: Lasse Nedergaard - "Our successful journey with Fl...

Rails performance at Justin.tv - Guillaume Luccisano

The care and feeding of a MySQL database

My talk at Linux Piter 2015

Advanced Administration, Monitoring and Backup

Failover or not to failover

Pluk2013 bodybuilding ratheesh

Sheepdog Status Report

Top 5 mistakes when writing Spark applications

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Low level java programming

5 Tips for Getting Started with Pivotal GemFire

Leveraging Databricks for Spark Pipelines

Recently uploaded

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Architecting Cloud Native ApplicationsWSO2

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Why Teams call analytics are critical to your entire businesspanagenda

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

DBX First Quarter 2024 Investor PresentationDropbox

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

CNIC Information System with Pakdata Cf In Pakistan

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Architecting Cloud Native Applications

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

MS Copilot expands with MS Graph connectors

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Why Teams call analytics are critical to your entire business

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Ransomware_Q4_2023. The report. [EN].pdf

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

DBX First Quarter 2024 Investor Presentation

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

1. “TOO BIG TO FAILOVER” A cautionary tale of scaling Redis Aaron Pollack - May 2017

2. Presentation Summary 2 ● How redis is used at napster ● Problems with failover at scale ● Our solution for constant time failovers

3. ‹#› Napster is still around?

4. ‹#› ● Rhapsody rebranded as Napster last Spring ● Provides on-demand and radio streaming for mobile and desktop apps ● Powers on-demand streaming for apps like iHeartRadio The cat is back!

5. 5 API.NAPSTER.COM +

6. ‹#› NAPSTER API SNAPSHOT ● API Gateway Layer ● 1k developers using the API ● 70m request/day ● 7k Redis ops/sec

7. ‹#› We LOVE Redis (mostly) ● Fast! - Response times <10ms to Redis cluster with network round trip included. ● Simple - Built in data types translate easily into JS. Replication comes free. ● Available - Redis is mission critical for us. When it’s down, we’re down.

8. Architected for Speed 8

9. So What’s The Problem? 9

10. So What’s The Problem? 10 ● Redis server and sentinel share the same host

11. So What’s The Problem? 11 ● Redis server and sentinel share the same host ● Four sentinels a. An even number means that there is a chance for ties if quorum is 2

12. So What’s The Problem? 12 ● Redis server and sentinel share the same host ● Four sentinels a. An even number means that there is a chance for ties if quorum is 2 ● Sending all read traffic to slaves means that you have downtime during failover

13. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated 3. A new slave is elected master 4. New master does full BGSAVE 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover 13

14. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 14

15. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 15

16. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory 7. Slave serves traffic Steps in Failover (1GB in Memory) 16

17. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory (8 seconds) 7. Slave serves traffic Steps in Failover (1GB in Memory) 17

18. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (9 seconds) 5. Master syncs data to existing slaves (39 seconds) 6. Data is loaded into memory (8 seconds) 7. Slave serves traffic Steps in Failover (1GB in Memory) 18 Total Time: ~1.5 minutes

19. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (40 seconds) 5. Master syncs data to existing slaves (122 seconds) 6. Data is loaded into memory (43 seconds) 7. Slave serves traffic Steps in Failover (5GB in Memory) 19 Total Time: 3 minutes

20. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (181 seconds) 5. Master syncs data to existing slaves (305 seconds) 6. Data is loaded into memory (238 seconds) 7. Slave serves traffic Steps in Failover (20GB in Memory) 20 Total Time: ~12.5 minutes

21. 1. Master is unreachable 2. Sentinels reach quorum and failover is initiated (30 seconds) 3. A new slave is elected master 4. New master does full BGSAVE (243 seconds) 5. Master syncs data to existing slaves (425 seconds) 6. Data is loaded into memory (354 seconds) 7. Slave serves traffic Steps in Failover (40GB in Memory) 21 Total Time: ~18 minutes

22.

23. 23 Slaves Become Unreachable During Failover

24. 1. What is causing the failover? 2. Why is the data growing so quickly? Investigation 24

25. 1. Out of memory 1. What’s causing the failover? 25

26. 1. Out of memory 2. Saturated client connections 1. What’s causing the failover? 26

27. 1. Out of memory 2. Saturated client connections 3. Gremlins 1. What’s causing the failover? 27

28. 1. Can you control the growth of data? 2. If you can’t control it, at least monitor it! 3. Think about data in terms of volatile vs non-volatile 2. Why is the data growing so quickly? 28

29. 1. Connection Pooling! a. https://github.com/luin/ioredis 2. Fast fail if connection is not ready 3. Backoff strategy for retry 3. How can we be better clients of Redis? 29

30. ioredis 30 https://github.com/luin/ioredis https://www.npmjs.com/package/ioredis

31. Client Singleton 31

32. Tuning ioredis Config 32 1. keepAlive - 0 (by default) enable connection pooling to redis 2. connectTimeout - milliseconds before a timeout occurs during the initial connection to the Redis server 3. enableReadyCheck - wait for server to load database from disk before sending commands 4. retryStrategy - wait an increasing amount of time with each connection attempt.

33. 1. Volatile vs non-volatile a. Are you setting a ttl on keys? 2. What data is accessed the most? 4. Build your redis env around your data 33

34. Client Initializer 34

35. Architected for Availability

36. THANK YOU Me: apollack@napster.com github.com/lolpack lolpack.me

37. ‹#› Napster API Team: @napsterAPI Links: White Paper: lolpack.me/rediswhitepaper.pdf Try out Napster: order.napster.com/developer API Docs: developer.napster.com

Editor's Notes

Issues my team faced while scaling redis in production
Address the elephant in the room
How we use Redis I work on the team that provides the public facing API for napster We use redis to store information about our developers and to authenticate our users
70 million that fall through the cache We store data about our developers, some user data, but mostly token sets as part of the Oauth flow
If you lose one, you lose the other. You are subject to the 28K port limit
A quorum of 3 when you only have 4 sentinels can delay the time it takes to elect a new master.
Once the new master is elected, it can immediately handle writes
The default of 30 seconds allows for network hiccups and any other event that might trigger an unnecessary failover. We’ve tried to tune this down to decrease overall failover time and if it’s too short it becomes too sensitive
When developing with small data sets it’s almost unnoticeable
Authenticated calls are failing Some health checks are failing By the time you have been alerted and look at the problem it’s fixed itself
Unacceptable amount of downtime A restart won’t do anything for you. You are at the mercy of the time it takes to sync.
- Can anyone else who has been on call relate?
There is a linear correlation between data growth and the time it takes a slave to recover and become readable. BGSAVE doubles memory Perfect storm of connections piling up, bgsave memory issue and tokens not expiring fast enough
The dust has settled and now it’s time to investigate the issue
Set a maxmemory and key expire policy Key expiry policy only works for ephemeral data or if you are willing to lose persisted data
Make sure your app/client is not making a bad problem worse for redis by re-establishing connections as soon as they fail
Systems will fail, so building redundancy into critical systems is essential
We are at the mercy of our client’s implementation of Oauth Monitoring usage allows us to proactively reach out to developers so they understand how the API should be used and we don’t have to store extra data We found a client was requesting a new Auth Token before each authenticated call We have to allow all new token sets in and don’t have a way of eagerly expiring old refresh tokens Developer data has to stay, ephemeral data like refresh tokens can go
Switched NPM packages to ioredis and have never looked back There was a bug in our old package where it wouldn’t kill the old connection after a failed redis lookup Hit 28K port limit during redis outage
Finally, some code! Create a global client referenced in the function to create a JS singleton
Finally, some code! Create a global client referenced in the function to create a JS singleton Ensures any place we require redis throughout the app is using the same connection
Key Configuration: `role: master` These configs are helpful during problem or outage situations Enable offline queue is dangerous for us - the only time we are offline is during an outage, so queueing up requests is not doing us any favors Retry Strategy: Good for network outages or failovers
Redis is so fast and flexible, you may not consider volatility vs space issues We we’re storing critical data with ephemeral data
The speed is not too shaby either: we can still auth a user in <50ms with backend roundtrip included. We traded some performance, but not too much No redis downtime since split Easy upgrades (30 second failover)
You can go to order.napster.com/developer and get a free 6 month trial of Napster. Build an app with our APIs and then tweet at us, we would love to see what you come up with!

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

Similar to RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis (20)

More from Redis Labs

More from Redis Labs (20)

Recently uploaded

Recently uploaded (20)

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

Editor's Notes