Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages

1
SABA KHALILNAJI saba@doordash.com
ASHWIN KACHHARA ashwin@doordash.com
12/15/2020
Using Kafka to Replace RabbitMQ
and Eliminate Task Processing
Outages at DoorDash

Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages at DoorDash
2
Contents
Introduction
Problems we faced with Celery / RabbitMQ
Potential solutions to problems with Celery / RabbitMQ
Kafka Onboarding Strategy
No solution is perfect
Key Wins
Other use-cases of Kafka at DoorDash
Conclusion
Acknowledgements

3
Tasks related to diﬀerent use-cases
leverage diﬀerent topics with their
dedicated worker pools, based on volume.
Introduction

4
Problems we faced with
RabbitMQ & Celery

5
Issues with availability
● Some of our outages were caused by heavy use of Celery scheduled tasks with ETA
● Sudden bursts of traﬃc left RabbitMQ in a degraded state with low throughput
● Our uWSGI worker’s harakiri setting caused a connection churn to RabbitMQ AND cascading failure
● Celery task processing would stop with no evidence of resource constraints, requiring a restart

6
Other problems with Celery and RabbitMQ
SCALABILITY
Reached the maximum vertical
scale available to us. The provider
HA mode limited our capacity.
OBSERVABILITY
Limited to a small set of RabbitMQ
metrics available to us. Limited
visibility into the Celery workers.
OPERATIONAL EFFICIENCY
Unsustainable time spent operating
and maintaining RabbitMQ. Not enough
in-house RabbitMQ expertise.

7
Potential Solutions to the problems
with RabbitMQ and Celery

8
CELERY BROKER CHANGE
Continue using Celery with a potentially more
reliable backing data store.
MULTI-BROKER SYSTEM
Shard task processing across multiple
brokers to reduce average load.
RMQ / CELERY VERSION UPGRADE
Leverage potential reliability ﬁxes in newer
versions, buying us some time.
CUSTOM KAFKA SOLUTION
More eﬀort than any other solution, but potential
to solve all our problems (by design).
Potential solutions we considered

PROS
9
Change the Celery Broker to Redis
● Improved availability & observability w/ ECC & multi-AZ
● Improved operational eﬃciency
● In-house operational experience & expertise w/ Redis
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Redis performance
● Incompatible w/ Redis clustered mode
● Single node Redis does not scale horizontally
● No Celery observability improvements
● Does not address stopped worker problem
CONS
Option #1
Does not solve scalability, only partially solves observability, and does not address worker stopped problem

PROS
10
Change the Celery Broker to Kafka
● Kafka can be highly available and horizontally scalable
● Improved observability and operational eﬃciency
● The team has lots of Kafka expertise
● Broker swap is a simple supported option in Celery
● Connection churn doesn’t degrade Kafka performance
● Kafka is not supported by Celery yet
● No Celery observability improvements
● Insuﬃcient experience operating Kafka at scale
CONS
Option #2
Only partially solves observability, does not address worker stopped problem AND not supported out of the box

PROS
11
Multi-Broker Solution
● Improved availability
● Horizontal scalability
● Comparatively less eﬀort required
● No observability or operational eﬃciency boosts
● Does not address connection churn issue
CONS
Option #3
Does not solve observability, connection churn, nor worker stopped problem

PROS
12
Upgrade both Celery & RabbitMQ versions
● Might prevent RabbitMQ getting stuck
● Might prevent Celery workers getting stuck
● Buys us time to work on a longer-term strategy
● Will not ﬁx any issues immediately
● Requires newer versions of Python
● Does not address connection churn issue
CONS
Option #4
Might prevent stuck Celery workers, but doesn’t deﬁnitely solve anything else

PROS
13
Building a custom Kafka solution
● Kafka can be highly available and horizontally scalable
● Improved observability and operational eﬃciency
● Team has a lot of in-house Kafka expertise
● Broker change is a straightforward option
● Connection churn doesn’t degrade Kafka performance
● Addresses stopped worker problem
● More work to implement compared to other options
● Minimal team experience operating Kafka at scale
CONS
Option #5
Solves all our problems. Most amount of eﬀort required, and limited experience operating at scale

15
It addressed all the problems we were facing, while also being an industry standard
that can scale. Kafka would give us full control over observability and availability.
Building a custom Kafka Solution!

HITTING THE GROUND RUNNING
17
Kafka Onboarding Strategy
Leverage the basic solution as we’re
iterating on other parts of it. “Racing a
car while swapping in a new fuel pump”
Maintain the same task interface for
seamless, no-hassle adoption and
minimize eﬀort on the part of developers
NO-OP ADOPTION
Instead of a big ﬂashy release, ship
smaller independent features that can
be individually tested
INCREMENTAL ROLLOUT, ZERO DOWNTIME

18
ONBOARDING STRATEGY
We built a minimum viable product (MVP) to
bring us interim stability and buy us time to
iterate on a more comprehensive solution.
Hitting the
ground running

19
ONBOARDING STRATEGY
We launched our MVP after 2 weeks of
development. We achieved an 80% reduction
in RabbitMQ task load a week after that.
Hitting the
ground running

20
Seamless adoption, incremental rollout
● We implemented a wrapper for Celery’s @task annotation
● Allowed us to route task submissions to either system dynamically
● As soon as a subfeature of Celery had been ported, tasks using it could now be migrated (seconds)
ONBOARDING STRATEGY

21
ITERATE AS NEEDED
No solution is perfect

22
NO SOLUTION IS PERFECT
A “slow” message in a partition can
block all messages behind it from
getting processed.
Head-of-the-line
blocking

23
NO SOLUTION IS PERFECT
Consists of
● 1 x Local message queue
● 1 x Kafka-consumer process
● N x Task-executor processes
A “slow” message only blocks a single
task-executor process till it completes.
Other messages in the partition can
continue to ﬂow.
Non-blocking
task consumer

24
● Kafka is not a hard dependency for Cadence
● Useful to execute & schedule multi-step workﬂows in a distributed service ecosystem
● Distributed, scalable, durable, and highly available
● Orchestration asynchronous business logic scalably and with resilience
Scheduled tasks (and more) via

26
Conclusion & Key Wins
NO MORE REPEATED
OUTAGES
Dealt with outage problem within 3 weeks
of development, giving us more time after
that to focus on esoteric features.
PROCESSING NO LONGER A BOTTLENECK
Task processing was no longer a bottleneck
allowing DoorDash to continuing growing
and serving customers
10x INCREASED OBSERVABILITY
Granular observability in prod and dev
environments, improving conﬁdence as well
as developer productivity.
OPERATIONAL DECENTRALIZATION
Enable developers to debug their
operational issues, and perform
cluster-management ops if needed.

27
Other notable use-cases
of Kafka at DoorDash

28
OTHER USE-CASES
Receive real-time production
and analytics events
Kafka REST Proxy
Apache Flink
Current Scale
● 800B events / day
● Peak > 200k / sec
Real-Time Streaming
Platform

29
OTHER USE-CASES
Standardized events with schema
defn. as Protobuf or Avro
● Low latency
● Lower costs
● Better Data Quality
Our Iguazu
Pipeline

30
OTHER USE-CASES
Huge boost in
● Indexing speed
● Accuracy
Search
Indexing

31
It takes a village!
Engineering Branding:
Ezra Berger
Wayne Cunningham
3131
Engineering:
Clement Fang, Corry Haines, Danial Asif, Jay Weinstein, Luigi Tagliamonte, Matthew Anger,
Shaohua Zhou, Yun-Yu Chen, Allen Wang, Matan Amir

32
SABA KHALILNAJI
ASHWIN KACHHARA
12/15/2020
Thank you

33
● https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/
● https://doordash.engineering/2020/08/14/workﬂows-cadence-event-driven-processing/
● https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/
Further Reading

Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages

Recommended

Recommended

More Related Content

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Doordash: Using Kafka to Replace RabbitMQ and Eliminate Task Processing Outages