Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Uber: Kafka Consumer Proxy


Published on

Haitao Zhang, Uber, Software Engineer + Yang Yang, Uber, Senior Software Engineer

Kafka Consumer Proxy is a forwarding proxy that consumes messages from Kafka and dispatches them to a user registered gRPC service endpoint. With Kafka Consumer Proxy, the experience of consuming messages from Apache Kafka for pub-sub use cases is as seamless and user-friendly as receiving (g)RPC requests. In this talk, we will share (1) the motivation for building this service, (2) the high-level architecture, (3) the mechanisms we designed to achieve high availability, scalability, and reliability, and (4) the current adoption status.

Published in: Technology
  • Be the first to comment

Uber: Kafka Consumer Proxy

  1. 1. Kafka Consumer Proxy Haitao Zhang & Yang Yang Uber Technologies, Inc. 10/21/2020 1
  2. 2. Agenda ● Background ● Motivation ● High-level Architecture ● Implementation ● Resource Isolation ● User Integration ● Summary & Future Work 2
  3. 3. Background 3
  4. 4. Kafka at Uber 4
  5. 5. Different Kafka Consumer Types at Uber 5 Consumer Type # of Owning Teams Library/Framework Real-time Analysis 1 Flink Batch Analysis 1 Spark Logging 2 Open source Kafka consumer library Pub-sub > 500 Open source Kafka consumer library / in-house libraries
  6. 6. Motivation 6 Is the consumer library the right choice for pub-sub use cases?
  7. 7. Kafka Consumer has Steep Learning Curve Requires user to have deep understanding on Kafka ● Internal Concepts ○ consumer group ○ offset commit ○ group rebalance ● Configuration Parameters: ~40 7 Inappropriate setup leads to problems, or even outages ● Endless group rebalancing problem ● Stuck partitions
  8. 8. Users Need a Messaging System 8 Feature Pubsub / Messaging Streaming (Kafka) Message Consumption/Processing Every single message is processed independently of every other message Messages are processed in order per partition. Acknowledgement Single message (unordered) acknowledgements Commit up to Retry Retry messages without blocking newer inbound messages Retry a message (may) block inbound messages Dead Letter Queue Exists for storing messages with non-transient error N/A
  9. 9. Kafka Consumer Does Not Align with Microservice Well 9 Feature Microservice Kafka Communication model Senders push messages to receivers Receivers pull messages from senders Flow Control Senders controls RPC rate Receivers controls message polling rate Scalability It’s easy to scale up the # of receivers The # of receivers is limited by the # of partitions Most complexities are on the sender side Most complexities are on the receiver side
  10. 10. Just deliver my data to me (at least once)! No complexities! 10
  11. 11. Heavy User Support Workload ● 500+ services use Kafka for pub-sub ○ Support/maintain different languages/versions of consumer libraries ● Average 4 engineer hours per day for user support ○ Debugging, administration requests, and outage mitigation ○ Kafka consumer is difficult to get right ○ Kafka consumer does not align with microservice well ● Client library upgrade - mission impossible! 11
  12. 12. Anti-pattern User Behavior ● Use randomly-generated strings as consumer group name ● Use Kafka to broadcast messages ● Use commit offsets incorrectly ● ... ● The combination of the above 12
  13. 13. More control over consumers Less user support 13
  14. 14. High-Level Architecture 14
  15. 15. A Proxy ● Runs independently of Kafka servers and (message receiver) services ● Pulls messages from Kafka servers using Kafka protocol ● Dispatches messages to services via gRPC and load balancer 1515 ● Take complexity from receiver services → easy for users to understand and use ● Centralize management → easy for Kafka managers to maintain/upgrade/control
  16. 16. Implementation 16
  17. 17. Architecture -- A Closer Look ● Kafka Consumer Proxy ○ Controller and worker ○ Data Plane (data flow) and Control Plane (task and worker management) ● Task -- a unit of work ○ message pulling and dispatching work for a <cluster, consumer group, topic, partition> tuple ○ managed/placed by control plane ○ executed by data plane 17
  18. 18. Data Plane Modules ● Fetcher handles communication with Kafka Clusters ● Processor controls message fetching, dispatching and offset commit. ● Dispatcher handles communication with receivers 18
  19. 19. Fetcher: Kafka Consumer ● Reduces load on Kafka brokers (group coordinator) ○ Uses assign instead of subscribe ○ Compacts offset commits 19
  20. 20. Processor: Messaging System ● Flow Control ○ Limits the upper-bound of message fetching/dispatching speed ● Retry queue/Dead letter queue ○ Retry and DLQ as separate Kafka topics ○ Handles retriable and non-retriable errors ○ Non-blocking -- When a messages are produced to either queues, the offset can be committed. ● Acknowledgement management ○ Receivers ack/nack(i.e., send errors back) each individual message or timeout ○ Accumulates acked/nacked offsets ○ Commits offsets to Kafka 20
  21. 21. Dispatcher: Kafka-gRPC Protocol ● Map Kafka messages to gRPC messages ○ A Kafka message is mapped to a gRPC Request ■ Use gRPC header field path to find handlers for the given (consumer group, topic). ■ Kafka metadata (topic, partition, offsert, key, etcs.) and payload are in DATA frames. ○ A Kafka message process result is mapped to a gRPC Response ■ Use gRPC status code and message to indicate how to further process the message ○ A thin client on the receiver service side to hide details from users ● Make use of load balancers ○ # of receiver instances is not limited by # of partitions 21
  22. 22. Control Plane -- Controller ● One serves as the leader and does the real work ○ Receives commands from operator/manager ○ Manages tasks ■ Calculates task assignment across workers ○ Manages workers ■ Tracks worker liveness ■ Sends task assignment to worker 22
  23. 23. Control Plane -- Worker ● Heartbeats with the controller leader ○ Maintains a state machine ○ Shuts down the worker on heartbeat failure ● Task management ○ Gets task assignment from the controller ○ Updates task assignment to Data Plane 23
  24. 24. Resource Isolation 24 To improve availability, scalability and reliability
  25. 25. How to Make It Work at Uber’s Scale 25 ● Uber has 500+ pub-sub use cases ○ Different availability requirements ○ Different E2E latency requirements ○ Different traffic volume ○ Different message processing time ○ And more ... 3 levels of resource isolation
  26. 26. Clusters & Global Manager ● Multiple Clusters ○ Different environments ■ dev, preprod, production ○ Different SLAs ○ Each cluster has a reasonable size ● A global manager ○ Runs independently ○ Decide which cluster to run jobs ■ Handles cluster failover 26
  27. 27. Virtual Worker Pool ● Each destination service is assigned a set of worker ○ A pool is solely responsible for a service. ○ Tasks can be dynamically re-assigned from one worker to another within the pool. ○ The pool is dynamically scaled up/down based on workload. ○ Workers are allocated from spare pool. 27
  28. 28. Pipeline ● Each worker might create multiple pipelines ○ Each pipeline is for a (kafka cluster, consumer group, topic) ○ Each pipeline has its own resources -- threads, memory, network connections 28
  29. 29. User Integration 29
  30. 30. User Onboard ● Updates Consumer Proxy Configuration ○ Kafka Consumer Proxy uses scripts to check/validate configurations ● Implements message handler 30
  31. 31. Monitoring - RPC 31
  32. 32. Monitoring - Consumer Lag 32
  33. 33. Summary & Future Work 33
  34. 34. Wrap Up ● Kafka Consumer Proxy is a service running independently ○ Pulls messages from Kafka server ○ Dispatches messages to receiver services ● Kafka Consumer Proxy focuses on resolving pain points of pub-sub use cases ○ Simplifies pub-sub user implementation ○ Provides Kafka team more control over Kafka Consumers ○ Improves overall Kafka availability/scalability/reliability 34
  35. 35. Future Work ● Improve worker auto scaling mechanisms ○ auto-scale virtual pool ○ auto-scale spare worker pool ● Design and implement adaptable load balancing and flow control mechanisms ● Integrate user notification/alerting about abnormal behaviour ● Design and implement GUI to simplify user onboarding ● Eng Blog & Open Source 35
  36. 36. Q & AUber is hiring 36