SlideShare a Scribd company logo
1
SABA KHALILNAJI saba@doordash.com
ASHWIN KACHHARA ashwin@doordash.com
12/15/2020
Using Kafka to Replace RabbitMQ
and Eliminate Task Processing
Outages at DoorDash
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
2
Contents
Introduction
Problems we faced with Celery / RabbitMQ
Potential solutions to problems with Celery / RabbitMQ
Kafka Onboarding Strategy
No solution is perfect
Key Wins
Other use-cases of Kafka at DoorDash
Conclusion
Acknowledgements
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
3
Tasks related to different use-cases
leverage different topics with their
dedicated worker pools, based on volume.
Introduction
4
Problems we faced with
RabbitMQ & Celery
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
5
Issues with availability
● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA
● Sudden bursts of traffic left RabbitMQ in a degraded state with low throughput
● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure
● Celery task processing would stop with no evidence of resource constraints, requiring a restart
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
6
Other problems with Celery and RabbitMQ
SCALABILITY
Reached the maximum vertical
scale available to us. The provider
HA mode limited our capacity.
OBSERVABILITY
Limited to a small set of RabbitMQ
metrics available to us. Limited
visibility into the Celery workers.
OPERATIONAL EFFICIENCY
Unsustainable time spent operating
and maintaining RabbitMQ. Not enough
in-house RabbitMQ expertise.
7
Potential Solutions to the problems
with RabbitMQ and Celery
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
8
CELERY BROKER CHANGE
Continue using Celery with a potentially more
reliable backing data store.
MULTI-BROKER SYSTEM
Shard task processing across multiple
brokers to reduce average load.
RMQ / CELERY VERSION UPGRADE
Leverage potential reliability fixes in newer
versions, buying us some time.
CUSTOM KAFKA SOLUTION
More effort than any other solution, but potential
to solve all our problems (by design).
Potential solutions we considered
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
9
Change the Celery Broker to Redis
● Improved availability & observability w/ ECC & multi-AZ
● Improved operational efficiency
● In-house operational experience & expertise w/ Redis
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Redis performance
● Incompatible w/ Redis clustered mode
● Single node Redis does not scale horizontally
● No Celery observability improvements
● Does not address stopped worker problem
CONS
Option #1
Does not solve scalability, only partially solves observability, and does not address worker stopped problem
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
10
Change the Celery Broker to Kafka
● Kafka can be highly available and horizontally scalable
● Improved observability and operational efficiency
● The team has lots of Kafka expertise
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Kafka performance
● Kafka is not supported by Celery yet
● No Celery observability improvements
● Does not address stopped worker problem
● Insufficient experience operating Kafka at scale
CONS
Option #2
Only partially solves observability, does not address worker stopped problem AND not supported out of the box
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
11
Multi-Broker Solution
● Improved availability
● Horizontal scalability
● Comparatively less effort required
● No observability or operational efficiency boosts
● Does not address stopped worker problem
● Does not address connection churn issue
CONS
Option #3
Does not solve observability, connection churn, nor worker stopped problem
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
12
Upgrade both Celery & RabbitMQ versions
● Might prevent RabbitMQ getting stuck
● Might prevent Celery workers getting stuck
● Buys us time to work on a longer-term strategy
● Will not fix any issues immediately
● Requires newer versions of Python
● Does not address connection churn issue
CONS
Option #4
Might prevent stuck Celery workers, but doesn’t definitely solve anything else
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
PROS
13
Building a custom Kafka solution
● Kafka can be highly available and horizontally scalable
● Improved observability and operational efficiency
● Team has a lot of in-house Kafka expertise
● Broker change is a straightforward option
● Connection churn doesn’t degrade Kafka performance
● Addresses stopped worker problem
● More work to implement compared to other options
● Minimal team experience operating Kafka at scale
CONS
Option #5
Solves all our problems. Most amount of effort required, and limited experience operating at scale
14
And the winner is…
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
15
It addressed all the problems we were facing, while also being an industry standard
that can scale. Kafka would give us full control over observability and availability.
Building a custom Kafka Solution!
16
Kafka Onboarding
Strategy
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
HITTING THE GROUND RUNNING
17
Kafka Onboarding Strategy
Leverage the basic solution as we’re
iterating on other parts of it. “Racing a
car while swapping in a new fuel pump”
Maintain the same task interface for
seamless, no-hassle adoption and
minimize effort on the part of developers
NO-OP ADOPTION
Instead of a big flashy release, ship
smaller independent features that can
be individually tested
INCREMENTAL ROLLOUT, ZERO DOWNTIME
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
18
ONBOARDING STRATEGY
We built a minimum viable product (MVP) to
bring us interim stability and buy us time to
iterate on a more comprehensive solution.
Hitting the
ground running
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
19
ONBOARDING STRATEGY
We launched our MVP after 2 weeks of
development. We achieved an 80% reduction
in RabbitMQ task load a week after that.
Hitting the
ground running
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
20
Seamless adoption, incremental rollout
● We implemented a wrapper for Celery’s @task annotation
● Allowed us to route task submissions to either system dynamically
● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds)
ONBOARDING STRATEGY
21
ITERATE AS NEEDED
No solution is perfect
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
22
NO SOLUTION IS PERFECT
A “slow” message in a partition can
block all messages behind it from
getting processed.
Head-of-the-line
blocking
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
23
NO SOLUTION IS PERFECT
Consists of
● 1 x Local message queue
● 1 x Kafka-consumer process
● N x Task-executor processes
A “slow” message only blocks a single
task-executor process till it completes.
Other messages in the partition can
continue to flow.
Non-blocking
task consumer
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
24
● Kafka is not a hard dependency for Cadence
● Useful to execute & schedule multi-step workflows in a distributed service ecosystem
● Distributed, scalable, durable, and highly available
● Orchestration asynchronous business logic scalably and with resilience
Scheduled tasks (and more) via
25
Conclusions
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
26
Conclusion & Key Wins
NO MORE REPEATED
OUTAGES
Dealt with outage problem within 3 weeks
of development, giving us more time after
that to focus on esoteric features.
PROCESSING NO LONGER A BOTTLENECK
Task processing was no longer a bottleneck
allowing DoorDash to continuing growing
and serving customers
10x INCREASED OBSERVABILITY
Granular observability in prod and dev
environments, improving confidence as well
as developer productivity.
OPERATIONAL DECENTRALIZATION
Enable developers to debug their
operational issues, and perform
cluster-management ops if needed.
27
Other notable use-cases
of Kafka at DoorDash
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
28
OTHER USE-CASES
Receive real-time production
and analytics events
Kafka REST Proxy
Apache Flink
Current Scale
● 800B events / day
● Peak > 200k / sec
Real-Time Streaming
Platform
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
29
OTHER USE-CASES
Standardized events with schema
defn. as Protobuf or Avro
● Low latency
● Lower costs
● Better Data Quality
Our Iguazu
Pipeline
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
30
OTHER USE-CASES
Huge boost in
● Indexing speed
● Accuracy
Search
Indexing
31
It takes a village!
Engineering Branding:
Ezra Berger
Wayne Cunningham
3131
Engineering:
Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger,
Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir
32
SABA KHALILNAJI
ASHWIN KACHHARA
12/15/2020
Thank you
Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
33
● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/
● https://doordash.engineering/2020/08/14/workflows-cadence-event-driven-processing/
● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/
Further Reading

More Related Content

More from confluent

AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
confluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
confluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
confluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
confluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
confluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
confluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
confluent
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
confluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
confluent
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Reply
confluent
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
confluent
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
confluent
 

More from confluent (20)

AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 
The Journey to Data Mesh with Confluent
The Journey to Data Mesh with ConfluentThe Journey to Data Mesh with Confluent
The Journey to Data Mesh with Confluent
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Confluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with ReplyConfluent Partner Tech Talk with Reply
Confluent Partner Tech Talk with Reply
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
 

Recently uploaded

"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
ScyllaDB
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 

Recently uploaded (20)

"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 

Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages

  • 1. 1 SABA KHALILNAJI saba@doordash.com ASHWIN KACHHARA ashwin@doordash.com 12/15/2020 Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
  • 2. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 2 Contents Introduction Problems we faced with Celery / RabbitMQ Potential solutions to problems with Celery / RabbitMQ Kafka Onboarding Strategy No solution is perfect Key Wins Other use-cases of Kafka at DoorDash Conclusion Acknowledgements
  • 3. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 3 Tasks related to different use-cases leverage different topics with their dedicated worker pools, based on volume. Introduction
  • 4. 4 Problems we faced with RabbitMQ & Celery
  • 5. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 5 Issues with availability ● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA ● Sudden bursts of traffic left RabbitMQ in a degraded state with low throughput ● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure ● Celery task processing would stop with no evidence of resource constraints, requiring a restart
  • 6. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 6 Other problems with Celery and RabbitMQ SCALABILITY Reached the maximum vertical scale available to us. The provider HA mode limited our capacity. OBSERVABILITY Limited to a small set of RabbitMQ metrics available to us. Limited visibility into the Celery workers. OPERATIONAL EFFICIENCY Unsustainable time spent operating and maintaining RabbitMQ. Not enough in-house RabbitMQ expertise.
  • 7. 7 Potential Solutions to the problems with RabbitMQ and Celery
  • 8. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 8 CELERY BROKER CHANGE Continue using Celery with a potentially more reliable backing data store. MULTI-BROKER SYSTEM Shard task processing across multiple brokers to reduce average load. RMQ / CELERY VERSION UPGRADE Leverage potential reliability fixes in newer versions, buying us some time. CUSTOM KAFKA SOLUTION More effort than any other solution, but potential to solve all our problems (by design). Potential solutions we considered
  • 9. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 9 Change the Celery Broker to Redis ● Improved availability & observability w/ ECC & multi-AZ ● Improved operational efficiency ● In-house operational experience & expertise w/ Redis ● Broker swap is a simple supported option in Celery ● Connection churn doesn’t degrade Redis performance ● Incompatible w/ Redis clustered mode ● Single node Redis does not scale horizontally ● No Celery observability improvements ● Does not address stopped worker problem CONS Option #1 Does not solve scalability, only partially solves observability, and does not address worker stopped problem
  • 10. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 10 Change the Celery Broker to Kafka ● Kafka can be highly available and horizontally scalable ● Improved observability and operational efficiency ● The team has lots of Kafka expertise ● Broker swap is a simple supported option in Celery ● Connection churn doesn’t degrade Kafka performance ● Kafka is not supported by Celery yet ● No Celery observability improvements ● Does not address stopped worker problem ● Insufficient experience operating Kafka at scale CONS Option #2 Only partially solves observability, does not address worker stopped problem AND not supported out of the box
  • 11. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 11 Multi-Broker Solution ● Improved availability ● Horizontal scalability ● Comparatively less effort required ● No observability or operational efficiency boosts ● Does not address stopped worker problem ● Does not address connection churn issue CONS Option #3 Does not solve observability, connection churn, nor worker stopped problem
  • 12. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 12 Upgrade both Celery & RabbitMQ versions ● Might prevent RabbitMQ getting stuck ● Might prevent Celery workers getting stuck ● Buys us time to work on a longer-term strategy ● Will not fix any issues immediately ● Requires newer versions of Python ● Does not address connection churn issue CONS Option #4 Might prevent stuck Celery workers, but doesn’t definitely solve anything else
  • 13. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash PROS 13 Building a custom Kafka solution ● Kafka can be highly available and horizontally scalable ● Improved observability and operational efficiency ● Team has a lot of in-house Kafka expertise ● Broker change is a straightforward option ● Connection churn doesn’t degrade Kafka performance ● Addresses stopped worker problem ● More work to implement compared to other options ● Minimal team experience operating Kafka at scale CONS Option #5 Solves all our problems. Most amount of effort required, and limited experience operating at scale
  • 15. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 15 It addressed all the problems we were facing, while also being an industry standard that can scale. Kafka would give us full control over observability and availability. Building a custom Kafka Solution!
  • 17. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash HITTING THE GROUND RUNNING 17 Kafka Onboarding Strategy Leverage the basic solution as we’re iterating on other parts of it. “Racing a car while swapping in a new fuel pump” Maintain the same task interface for seamless, no-hassle adoption and minimize effort on the part of developers NO-OP ADOPTION Instead of a big flashy release, ship smaller independent features that can be individually tested INCREMENTAL ROLLOUT, ZERO DOWNTIME
  • 18. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 18 ONBOARDING STRATEGY We built a minimum viable product (MVP) to bring us interim stability and buy us time to iterate on a more comprehensive solution. Hitting the ground running
  • 19. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 19 ONBOARDING STRATEGY We launched our MVP after 2 weeks of development. We achieved an 80% reduction in RabbitMQ task load a week after that. Hitting the ground running
  • 20. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 20 Seamless adoption, incremental rollout ● We implemented a wrapper for Celery’s @task annotation ● Allowed us to route task submissions to either system dynamically ● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds) ONBOARDING STRATEGY
  • 21. 21 ITERATE AS NEEDED No solution is perfect
  • 22. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 22 NO SOLUTION IS PERFECT A “slow” message in a partition can block all messages behind it from getting processed. Head-of-the-line blocking
  • 23. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 23 NO SOLUTION IS PERFECT Consists of ● 1 x Local message queue ● 1 x Kafka-consumer process ● N x Task-executor processes A “slow” message only blocks a single task-executor process till it completes. Other messages in the partition can continue to flow. Non-blocking task consumer
  • 24. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 24 ● Kafka is not a hard dependency for Cadence ● Useful to execute & schedule multi-step workflows in a distributed service ecosystem ● Distributed, scalable, durable, and highly available ● Orchestration asynchronous business logic scalably and with resilience Scheduled tasks (and more) via
  • 26. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 26 Conclusion & Key Wins NO MORE REPEATED OUTAGES Dealt with outage problem within 3 weeks of development, giving us more time after that to focus on esoteric features. PROCESSING NO LONGER A BOTTLENECK Task processing was no longer a bottleneck allowing DoorDash to continuing growing and serving customers 10x INCREASED OBSERVABILITY Granular observability in prod and dev environments, improving confidence as well as developer productivity. OPERATIONAL DECENTRALIZATION Enable developers to debug their operational issues, and perform cluster-management ops if needed.
  • 27. 27 Other notable use-cases of Kafka at DoorDash
  • 28. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 28 OTHER USE-CASES Receive real-time production and analytics events Kafka REST Proxy Apache Flink Current Scale ● 800B events / day ● Peak > 200k / sec Real-Time Streaming Platform
  • 29. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 29 OTHER USE-CASES Standardized events with schema defn. as Protobuf or Avro ● Low latency ● Lower costs ● Better Data Quality Our Iguazu Pipeline
  • 30. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 30 OTHER USE-CASES Huge boost in ● Indexing speed ● Accuracy Search Indexing
  • 31. 31 It takes a village! Engineering Branding: Ezra Berger Wayne Cunningham 3131 Engineering: Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger, Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir
  • 33. Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash 33 ● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/ ● https://doordash.engineering/2020/08/14/workflows-cadence-event-driven-processing/ ● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/ Further Reading