SlideShare a Scribd company logo
1 of 26
Nighthawk
Distributed caching with Redis @
Twitter
Rashmi Ramesh
@rashmi_ur
Agenda
What is Nighthawk?
How does it work?
Scaling out
High availability
Current challenges
Nighthawk - cache-as-a-service
Runs redis at it’s core
> 10M QPS,
Largest cluster runs ~3K redis nodes
> 10TB of data
Who uses Nighthawk?
Some of our biggest customers:
Analytics services - Ads, Video
Ad serving
Ad Exchange
Direct Messaging
Mobile app conversion tracking
Design Goals
Scalable: scale vertically and horizontally
Elastic: add / remove instances without violating SLA
High throughput and low latencies
High availability in the event of machine failures
Topology agnostic client
Nighthawk Architecture
Client
Proxy/Routing layer
Backend N
..……...
Redis 0 Redis N
Backend 0
..……...
Redis 0 Redis N
Topology
Cluster
manager
Cache backend
Mesos Container
Redis nodes
Topology
watcher and
announcer
1 2 3
NM
Proxy/Router
Replica 1 -> Redis1
Replica 2 -> Redis2
Replica 3 -> Redis3
Redis1(dc,host,port1,capacity)
Redis2(dc,host,port2, capacity)
Redis3(dc,host,port3,, capacity)
Topology
Cluster manager
Manages topology membership and changes
- (Re)Balances replicas
- Reacts to topology changes, eg: dead node
- Replicated cache - ensures 2 replicas of same partition are on separate
failure domains
Redis databases for partitions
Partition -> Redis DB
Granular key remapping
Logical data isolation
Enumerating - redis db scan
Deletion - flushdb
Enables replica rehydration
K1 K4K2 K3
Partition X Partition Y
1 2
Scaling
Scaling out with Client/Proxy managed
partitioningKey count: 1.5 M keys
Client
500K 500K500K
Scaling out with Client/Proxy managed
partitioningKey count: 1.5M keys
Remapped keys: 600K
Client
300K 300K300K 300K
300K
Persistent storage
Scaling out with Cluster manager
Key count: 1.5M keys
Partition count: 100
Keys/Partition: 15K
Client
Persistent storage
Proxy
Topology and
cluster manager
500K 500K500K
Scaling out with Cluster manager
Key count: 1.5M keys
Partition count: 100
Keys/Partition: 15K
Client
Persistent storage
Proxy
Topology and
cluster manager
500K 485K500K 15K
Scaling out with Cluster manager
Key count: 1.5M keys
Partition count: 100
Keys/Partition: 15K
Client
485K 485K500K 15K 15K
Persistent storage
Proxy
Topology and
cluster manager
Scaling out with Cluster manager - Post
balancingKey count: 1.5M keys
Partition count: 100
Post balancing...
Client
Persistent storage
Proxy
Topology and
cluster manager
250K 250K250K 250K 500K
Advantages over Client managed partitioning
- Thin client - simple and oblivious to topology
- Clients, proxy layer and backends scale independently
- Pluggable custom load balancing logic through cluster manager
- No cluster downtime during scaling out/up/back
High Availability
High Availability with Replication
Synchronous, best effort
RF = 2, Intra DC
Supports idempotent operations only - get, put, remove, count, scan
Copies of a partition never on the same host and rack
Passive warming for failed/restarted replicas
High Availability with Replication
Client
Proxy/Routing layer
Backend 0
Partition 2,5,9
Topology
Cluster
manager
GetKey in
Partition 5
GetKey in
Partition 5
SERVING
Backend N
Partition
12,5,10
SERVINGFAILED
Backend N*
Partition 12,5,10
WARMING
SetKey in
partition 5
Pool A Pool B
Current challenges
Remember this?
The most retweeted
Tweet of 2014!
Hot key symptom
Significantly high QPS to a single cache server
Hot Key Mitigation
Server side diagnostics:
Sampling a small % of requests and logging
Post processing the logs to identify high frequency keys
Client side solution:
Client side hot key detection and caching
Better to have:
Redis tracks the hot keys
Protocol support to send feedback to client if a key is hot
Active warming of replicas
Client
Proxy/Routing layer
Topology
Cluster
manager
Backend A
Partition 2,5,9
SERVING
Backend B*
Partition 12,5,10
WARMING
writes
Bootstrapper
Pool A
Pool B
Questions?

More Related Content

What's hot

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 

What's hot (20)

Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018
Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018
Deep Dive on PostgreSQL Databases on Amazon RDS (DAT324) - AWS re:Invent 2018
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

Similar to RedisConf17- Using Redis at scale @ Twitter

HP Storage: Delivering Storage without Boundaries
HP Storage: Delivering Storage without BoundariesHP Storage: Delivering Storage without Boundaries
HP Storage: Delivering Storage without Boundaries
jameshub12
 
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
Redis Labs
 

Similar to RedisConf17- Using Redis at scale @ Twitter (20)

HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18HandsOn ProxySQL Tutorial - PLSC18
HandsOn ProxySQL Tutorial - PLSC18
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is Here
 
WETEC HP Integrity Servers
WETEC HP Integrity ServersWETEC HP Integrity Servers
WETEC HP Integrity Servers
 
Hp Integrity Servers
Hp Integrity ServersHp Integrity Servers
Hp Integrity Servers
 
Large scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear passLarge scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear pass
 
HP Storage: Delivering Storage without Boundaries
HP Storage: Delivering Storage without BoundariesHP Storage: Delivering Storage without Boundaries
HP Storage: Delivering Storage without Boundaries
 
TechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlexTechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlex
 
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIOHigh Performance Object Storage in 30 Minutes with Supermicro and MinIO
High Performance Object Storage in 30 Minutes with Supermicro and MinIO
 
Perforce Server: The Next Generation
Perforce Server: The Next GenerationPerforce Server: The Next Generation
Perforce Server: The Next Generation
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
TechTarget Event - Storage Architectures for the Modern Data Centre – Martin ...
TechTarget Event - Storage Architectures for the Modern Data Centre – Martin ...TechTarget Event - Storage Architectures for the Modern Data Centre – Martin ...
TechTarget Event - Storage Architectures for the Modern Data Centre – Martin ...
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
What's new in confluent platform 5.4 online talk
What's new in confluent platform 5.4 online talkWhat's new in confluent platform 5.4 online talk
What's new in confluent platform 5.4 online talk
 
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
RedisConf17 - Redis Enterprise: Continuous Availability, Unlimited Scaling, S...
 
Techmeeting-17feb2016
Techmeeting-17feb2016Techmeeting-17feb2016
Techmeeting-17feb2016
 
MYSQL
MYSQLMYSQL
MYSQL
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & ClusterMySQL Database Architectures - InnoDB ReplicaSet & Cluster
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
 

More from Redis Labs

SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Redis Labs
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Redis Labs
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 

More from Redis Labs (20)

Redis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redisRedis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redis
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
 
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
 
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
 
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
 
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of OracleRedis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
 
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
 
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
 
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
 
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
 
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
 
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
 
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
 
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
 
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

RedisConf17- Using Redis at scale @ Twitter

  • 1. Nighthawk Distributed caching with Redis @ Twitter Rashmi Ramesh @rashmi_ur
  • 2. Agenda What is Nighthawk? How does it work? Scaling out High availability Current challenges
  • 3. Nighthawk - cache-as-a-service Runs redis at it’s core > 10M QPS, Largest cluster runs ~3K redis nodes > 10TB of data
  • 4. Who uses Nighthawk? Some of our biggest customers: Analytics services - Ads, Video Ad serving Ad Exchange Direct Messaging Mobile app conversion tracking
  • 5. Design Goals Scalable: scale vertically and horizontally Elastic: add / remove instances without violating SLA High throughput and low latencies High availability in the event of machine failures Topology agnostic client
  • 6. Nighthawk Architecture Client Proxy/Routing layer Backend N ..……... Redis 0 Redis N Backend 0 ..……... Redis 0 Redis N Topology Cluster manager
  • 7. Cache backend Mesos Container Redis nodes Topology watcher and announcer 1 2 3 NM Proxy/Router Replica 1 -> Redis1 Replica 2 -> Redis2 Replica 3 -> Redis3 Redis1(dc,host,port1,capacity) Redis2(dc,host,port2, capacity) Redis3(dc,host,port3,, capacity) Topology
  • 8. Cluster manager Manages topology membership and changes - (Re)Balances replicas - Reacts to topology changes, eg: dead node - Replicated cache - ensures 2 replicas of same partition are on separate failure domains
  • 9. Redis databases for partitions Partition -> Redis DB Granular key remapping Logical data isolation Enumerating - redis db scan Deletion - flushdb Enables replica rehydration K1 K4K2 K3 Partition X Partition Y 1 2
  • 11. Scaling out with Client/Proxy managed partitioningKey count: 1.5 M keys Client 500K 500K500K
  • 12. Scaling out with Client/Proxy managed partitioningKey count: 1.5M keys Remapped keys: 600K Client 300K 300K300K 300K 300K Persistent storage
  • 13. Scaling out with Cluster manager Key count: 1.5M keys Partition count: 100 Keys/Partition: 15K Client Persistent storage Proxy Topology and cluster manager 500K 500K500K
  • 14. Scaling out with Cluster manager Key count: 1.5M keys Partition count: 100 Keys/Partition: 15K Client Persistent storage Proxy Topology and cluster manager 500K 485K500K 15K
  • 15. Scaling out with Cluster manager Key count: 1.5M keys Partition count: 100 Keys/Partition: 15K Client 485K 485K500K 15K 15K Persistent storage Proxy Topology and cluster manager
  • 16. Scaling out with Cluster manager - Post balancingKey count: 1.5M keys Partition count: 100 Post balancing... Client Persistent storage Proxy Topology and cluster manager 250K 250K250K 250K 500K
  • 17. Advantages over Client managed partitioning - Thin client - simple and oblivious to topology - Clients, proxy layer and backends scale independently - Pluggable custom load balancing logic through cluster manager - No cluster downtime during scaling out/up/back
  • 19. High Availability with Replication Synchronous, best effort RF = 2, Intra DC Supports idempotent operations only - get, put, remove, count, scan Copies of a partition never on the same host and rack Passive warming for failed/restarted replicas
  • 20. High Availability with Replication Client Proxy/Routing layer Backend 0 Partition 2,5,9 Topology Cluster manager GetKey in Partition 5 GetKey in Partition 5 SERVING Backend N Partition 12,5,10 SERVINGFAILED Backend N* Partition 12,5,10 WARMING SetKey in partition 5 Pool A Pool B
  • 22. Remember this? The most retweeted Tweet of 2014!
  • 23. Hot key symptom Significantly high QPS to a single cache server
  • 24. Hot Key Mitigation Server side diagnostics: Sampling a small % of requests and logging Post processing the logs to identify high frequency keys Client side solution: Client side hot key detection and caching Better to have: Redis tracks the hot keys Protocol support to send feedback to client if a key is hot
  • 25. Active warming of replicas Client Proxy/Routing layer Topology Cluster manager Backend A Partition 2,5,9 SERVING Backend B* Partition 12,5,10 WARMING writes Bootstrapper Pool A Pool B

Editor's Notes

  1. Each major service gets it’s own cache cluster. 2 modes of operation - replicated and non replicated.
  2. Analytics services - Ads, Video - Ad engagement analytics, video ad engagement analytics Mobile app conversion tracking - tracks conversions like promoted app installs, in-app purchases and signups Ad serving - performs ad matching, scoring, and serving Ad Exchange - real time bidding for ads DM - direct messaging Interaction metrics service - provides different types of engagement metrics by tweet or by user
  3. Routing layer subscribes to topology changes and updates it’s current mapping of partition to redis node. For every request, it hashes the key and finds out which partition the key belongs to. It then figures which redis node it is mapped to and forwards the request to the appropriate redis. Each backend can have 1 or more redises. Since redis is single threaded, to increase throughput per container and fully utilize the resources allocated to the container- like bandwidth, CPU, RAM, the backend can have more than 1 redis. The backends also have a topology component that announces the currently running redis nodes. The cluster manager is in charge of creating partitions and managing topology. It is responsible for balancing replicas of partitions evenly across nodes, ensuring no replicas of the same partition are not down at the same time during managed data movement, ensuring dead nodes are removed from the topology after the partitions assigned to them have been successfully assigned to currenty available nodes. It also takes care of rate limited data movement from current nodes to newly joined nodes ensuring clients don’t see a huge number of cache misses as soon as the cluster is expanded. Trade off: Additional hop in proxy layer - for a topology agnostic client
  4. Runs in mesos containers Can have 1 or more redis instances running in each container Number of redis nodes per container - bound by server resources, amount of data to be store and data density per node. Announces information about the redis instances running to the topology Information: DC, host, port, device type, capacity … Capacity of a node - also can be referred to as weight - refers to how much data can be stored Watches and reacts to topology changes like new replica assigned to a local redis, or replica moving to a remote redis.
  5. Manages all the participants in the topology and maintains the sanity of the cluster Ensures every partition has a replica residing on an available node Balances replicas/partitions across nodes of the cluster. If nodes have different capacity, the number of replicas assigned to the nodes are proportional to their capacity
  6. Unit of data movement is much smaller - Moving 1/N keys in a redis vs a db in redis Moving a replica/partition is dropping all keys in a db in one redis and remapping the keys to another db in another redis
  7. Adding new nodes right away, causes Count(Keys)/Count(Nodes) to get remapped and will see a cache miss for those requests, hitting hard on the persistent storage. If proper checks and balances exist, persistent storage will rate limit the requests, or just serve with higher latencies and degraded throughput. In either case, clients will see errors and hit timeouts, thus undergoes Success rate degradation. There is no intelligent balancing if there is a higher config redis node, unless your have some sort of balancing logic inside the client. What an overload!
  8. If proxy layer is the bottleneck, you can add more proxy instances. If backends are the bottleneck, you can add more backends.
  9. Your persistent storage and the storage team will thank you for rate limiting how much traffic you send to it.
  10. State of the partitioning at the end of balancing.
  11. Topology schemes - you could use ZK in combination with consistent hashing, or maintain a changelog to store topology, or move to a totally different method for representing and storing topology. Clients don’t need to know about it. CLients don’t have to worry about replication factor, or how replication happens. New Administrative workflows can be added - automating rolling restart, node maintenance, migration with the help of CM.
  12. Why use replication? Data analytics pipeline Need to store real time data that have a relatively shorter lifetime (until batch jobs catch up) Computations are expensive to recompute on cache-miss User session data for current day Data lifetime of a day Expensive to store in a persistent key value store for the desired latency/throughput requirements Serves business goals for half the cost with better latencies.
  13. Trade offs RF > 2, adds to latency and cost Non idempotent operations not supported - incr/ decr
  14. Show writes when both are serving.
  15. Hot keys: Ellen’s tweet is a classic example of how a popular key snowballs into a hotkey. Key that gets a disproportionately high number of QPS. Manifests as a very busy cache server, slowing it down further, can result in b/w saturation if the value is large, and can result in packet drops, and client side timeouts.
  16. Quickly re-populating a warming replica using a serving copy Easy solution: Do nothing, rely on organic population of data on writes A better solution: Read data from a serving replica and write to the warming replica Rate limit copy to not impact production traffic latency and throughput