Kafka Cluster Federation at Uber
Yupeng Fu, Xiaoman Dong
Streaming Data Team, Uber
Apache Kafka @ Uber
PRODUCERS CONSUMERS
Real-time Analytics,
Alerts, DashboardsSamza / Flink
Applications
Data Science
Analytics
Reporting
Apache
Kafka
Vertica / Hive
Rider App
Driver App
API / Services
Etc.
Ad-hoc Exploration
ELK
Debugging
Hadoop
Surge
Mobile App
Cassandra
MySQL
DATABASES
(Internal) Services
AWS S3
Payment
PBsMessages / DayTrillions Data/day
Tens of
Thousands
Topics
Kafka Scale at Uber
excluding replication
ThousandsServices
Dozens clusters
When disaster strikes...
2 AM on Sat morning
Region 1
Kafka on-call paged, service owners paged
Emergency failover of services to another
region performed
Region 2
Region 1
What if ...
2 AM on Sat morning
Region 1
Redirect services’ traffic to another cluster
Cluster Federation: Cluster of Clusters
Kafka Users:
Kafka Team:
Cluster Federation: benefits
● Availability
○ Tolerate a single cluster downtime without user impact and region failover
● Scalability
○ Avoid giant Kafka cluster
○ Horizontally scale out Kafka clusters without disrupting users
● Ease of Operation and management
○ Easier maintenance of critical clusters like decomm, rebalance etc
○ Easier topic migration from one cluster to another
○ Easier topic discovery for users without knowing the actual clusters
High-level Design Concepts
● Users view a logical cluster
● A topic has a primary cluster and
secondary cluster(s)
● Clients fetch topic-cluster mapping and
determine which cluster to connect
● Dynamic traffic redirection of
consumers/producers without restart
● Data replication between the physical
clusters for redundancy
● Consumer progress sync between the
clusters
Design Challenges
● Producer/Consumer client traffic redirection
● Aggregate and serve the topic-to-cluster mappings
● Replication of data between clusters
● Consumer offset management
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
Architecture Overview
1. Client fetches metadata from
Kafka Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
#1 Kafka proxy for traffic redirection
● A proxy server that supports Kafka protocol of metadata requests
● Shares the same network implementation of Apache Kafka
● Routes the client to the Kafka cluster for fetch and produce
● Triggers a consumer group rebalance when the primary cluster changes
Kafka Proxy
#1 Kafka proxy and client interaction
ApiVersionRequest
Configured API version for the clusters
MetadataRequest
Metadata of the kafkaA (primary)
(Consumer)GroupCoordinatorRequest
GroupCoordinator response
kafkaA (primary)
Lookup the cache of
primary cluster
cache the primary
cluster to client
Kafka Client
metadata:
kafkaA-01
kafkaA-02
bootstrap.servers:
kafka-proxy-01
kafka-proxy-02
fetch/produce
from/to kafkaA
metadataUpdate
getLeastLoadedNode
getRandomNode
metadata:
kafkaB-01
kafkaB-02
fetch/produce
from/to kafkaB
Metadata of the kafkaB (primary)
#1 Kafka proxy internals
● Socket Server: serve the incoming metadata requests
● Metadata Provider: collect information from metadata service
● Zookeeper: local metadata cache
● Cluster Manager: manage the clients to the federated clusters
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
#2 Kafka Metadata Service
● The central service that manages the topic and cluster metadata
information
● Paired with a service that periodically syncs with all the physical
clusters
● Exposes endpoints for setting primary cluster
#2 Kafka Metadata Service
● Single entry point for topic metadata management
○ Topic creation/deletion
○ Partition expansion
○ Blacklist/Quota control etc
Metadata Service
Topic Creation
KafkaB
KafkaA
Topic
Creation
Topic
Creation
replication setup
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between
the clusters by uReplicator
4. Push-based offset sync between
the clusters
#3 Data replication - uReplicator
● Uber’s Kafka replication service derived from MirrorMaker
● Goals
○ Optimized and stable replication, e.g. rebalance only occurs during startup
○ Operate with ease, e.g. add/remove whitelists
○ Scalable, High throughput
● Open sourced: https://github.com/uber/uReplicator
● Blog: https://eng.uber.com/ureplicator/
#3 Data replication - cont’d
Improvements for Federation
● Header-based filter to avoid cyclic replication
○ Source cluster info written into message header
○ Messages will not be replicated back to its original cluster
○ Bi-directional replication becomes simple and easy
● Improved offset mapping for consumer management
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
○ Resume from smallest offset → Lots of Backlog & Duplicates
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
○ Resume from smallest offset → Lots of Backlog & Duplicates
○ Resume by timestamp → Complicated & Not Reliable
○ Trying to make topic offsets the same → Nearly impossible
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
○ Resume from smallest offset → Lots of Backlog & Duplicates
○ Resume by timestamp → Complicated & Not Reliable
○ Trying to make topic offsets the same → Nearly impossible
○ ✅ Offset manipulation by a dedicated service
#4 Offset Management - Solutions
#4 Offset Management - Offset Mapping
Goal: no data loss
● uReplicator copies data between clusters
● uReplicator knows the offset mapping
between clusters
#4 Offset Management - Offset Mapping
Goal: no data loss
● uReplicator copies data between clusters
● uReplicator knows the offset mapping
between clusters
● Offset mappings are reported periodically
into a DB
● Consuming starting from the the mapped
offset pair can guarantee no data loss
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
2. Offset sync service
a. Queries the Store, closest offset pair is
(13 mapped to 29)
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
2. Offset sync service
a. Queries the Store, closest offset pair is
(13 mapped to 29)
b. Commits offset 29 into Kafka B
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
2. Offset sync service
a. Queries the Store, closest offset pair is
(13 mapped to 29)
b. Commits offset 29 into Kafka B
3. Consumer redirected to Cluster B
a. Joins consumer group with same name
b. Resumes from offset 29 -- no loss
#4 Offset Management - Efficient Update
Kafka __consumer_offsets internal topic
● Kafka internal storage of consumer groups
● Each message in it is a changelog of consumer groups
● All offset commits are written as Kafka messages
● Can have huge traffic (thousands of messages per second)
#4 Offset Management - Efficient Update
Offset Sync: A Streaming Job
● Reads from __consumer_offsets topic
● Compacts offset commits into batches
● Then converts and updates the committed
offset into offset of other cluster(s)
The job monitors and reports all consumer group
metrics conveniently. Open source planned.
Federation In Action (1/6)
Federation In Action (2/6)
Federation In Action (3/6)
Federation In Action (4/6)
Federation In Action (5/6)
Federation In Action (6/6)
Tradeoff and limitation
● Data redundancy for higher availability: 2X replicas with 2 clusters
● Message out of order during failover transition
● Topic level federation is challenging for REST Proxy, and also for consumers
that subscribe to several topics or a pattern
● Consumer has to rely on Kafka clusters to manage offsets (e.g., not friendly to
some Flink consumers)
Q&A
Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains
information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate,
or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent
necessary for consultations with authorized personnel of Uber.
Highly available Kafka at Uber: Active-active
● Provide business resilience and continuity as
the top priority
● Active-active in multiple regions
○ Data produced locally via Rest proxy
○ Data aggregated to agg cluster
○ Active-active consumers
● Issues
○ Failover coordination and communication
required
○ Data unavailable in regional cluster during
downtime
Highly available Kafka at Uber: secondary cluster
● Provide business resilience and continuity as
the top priority
● When regional cluster is unavailable
○ Data produced to secondary cluster
○ Then replicated to regional when it’s back
● Issues
○ Unused capacity when regional cluster is
up
○ Regional cluster unavailable for
consumption during downtime
Topic Migration Challenge
Topic Migration Challenge
Topic Migration Challenge
Topic Migration Challenge

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summit SF 2019

  • 1.
    Kafka Cluster Federationat Uber Yupeng Fu, Xiaoman Dong Streaming Data Team, Uber
  • 2.
    Apache Kafka @Uber PRODUCERS CONSUMERS Real-time Analytics, Alerts, DashboardsSamza / Flink Applications Data Science Analytics Reporting Apache Kafka Vertica / Hive Rider App Driver App API / Services Etc. Ad-hoc Exploration ELK Debugging Hadoop Surge Mobile App Cassandra MySQL DATABASES (Internal) Services AWS S3 Payment
  • 3.
    PBsMessages / DayTrillionsData/day Tens of Thousands Topics Kafka Scale at Uber excluding replication ThousandsServices Dozens clusters
  • 4.
    When disaster strikes... 2AM on Sat morning Region 1 Kafka on-call paged, service owners paged Emergency failover of services to another region performed Region 2 Region 1
  • 5.
    What if ... 2AM on Sat morning Region 1 Redirect services’ traffic to another cluster
  • 6.
    Cluster Federation: Clusterof Clusters Kafka Users: Kafka Team:
  • 7.
    Cluster Federation: benefits ●Availability ○ Tolerate a single cluster downtime without user impact and region failover ● Scalability ○ Avoid giant Kafka cluster ○ Horizontally scale out Kafka clusters without disrupting users ● Ease of Operation and management ○ Easier maintenance of critical clusters like decomm, rebalance etc ○ Easier topic migration from one cluster to another ○ Easier topic discovery for users without knowing the actual clusters
  • 8.
    High-level Design Concepts ●Users view a logical cluster ● A topic has a primary cluster and secondary cluster(s) ● Clients fetch topic-cluster mapping and determine which cluster to connect ● Dynamic traffic redirection of consumers/producers without restart ● Data replication between the physical clusters for redundancy ● Consumer progress sync between the clusters
  • 9.
    Design Challenges ● Producer/Consumerclient traffic redirection ● Aggregate and serve the topic-to-cluster mappings ● Replication of data between clusters ● Consumer offset management
  • 10.
    Architecture Overview 1. Clientfetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 11.
    Architecture Overview 1. Clientfetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 12.
    #1 Kafka proxyfor traffic redirection ● A proxy server that supports Kafka protocol of metadata requests ● Shares the same network implementation of Apache Kafka ● Routes the client to the Kafka cluster for fetch and produce ● Triggers a consumer group rebalance when the primary cluster changes
  • 13.
    Kafka Proxy #1 Kafkaproxy and client interaction ApiVersionRequest Configured API version for the clusters MetadataRequest Metadata of the kafkaA (primary) (Consumer)GroupCoordinatorRequest GroupCoordinator response kafkaA (primary) Lookup the cache of primary cluster cache the primary cluster to client Kafka Client metadata: kafkaA-01 kafkaA-02 bootstrap.servers: kafka-proxy-01 kafka-proxy-02 fetch/produce from/to kafkaA metadataUpdate getLeastLoadedNode getRandomNode metadata: kafkaB-01 kafkaB-02 fetch/produce from/to kafkaB Metadata of the kafkaB (primary)
  • 14.
    #1 Kafka proxyinternals ● Socket Server: serve the incoming metadata requests ● Metadata Provider: collect information from metadata service ● Zookeeper: local metadata cache ● Cluster Manager: manage the clients to the federated clusters
  • 15.
    Architecture Overview 1. Clientfetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 16.
    #2 Kafka MetadataService ● The central service that manages the topic and cluster metadata information ● Paired with a service that periodically syncs with all the physical clusters ● Exposes endpoints for setting primary cluster
  • 17.
    #2 Kafka MetadataService ● Single entry point for topic metadata management ○ Topic creation/deletion ○ Partition expansion ○ Blacklist/Quota control etc Metadata Service Topic Creation KafkaB KafkaA Topic Creation Topic Creation replication setup
  • 18.
    Architecture Overview 1. Clientfetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 19.
    #3 Data replication- uReplicator ● Uber’s Kafka replication service derived from MirrorMaker ● Goals ○ Optimized and stable replication, e.g. rebalance only occurs during startup ○ Operate with ease, e.g. add/remove whitelists ○ Scalable, High throughput ● Open sourced: https://github.com/uber/uReplicator ● Blog: https://eng.uber.com/ureplicator/
  • 20.
    #3 Data replication- cont’d Improvements for Federation ● Header-based filter to avoid cyclic replication ○ Source cluster info written into message header ○ Messages will not be replicated back to its original cluster ○ Bi-directional replication becomes simple and easy ● Improved offset mapping for consumer management
  • 21.
    Architecture Overview 1. Clientfetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 22.
    ● Consumer shouldresume after switching cluster ● They will rejoin consumer group with the same name #4 Offset Management - Solutions
  • 23.
    ● Consumer shouldresume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss #4 Offset Management - Solutions
  • 24.
    ● Consumer shouldresume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss ○ Resume from smallest offset → Lots of Backlog & Duplicates #4 Offset Management - Solutions
  • 25.
    ● Consumer shouldresume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss ○ Resume from smallest offset → Lots of Backlog & Duplicates ○ Resume by timestamp → Complicated & Not Reliable ○ Trying to make topic offsets the same → Nearly impossible #4 Offset Management - Solutions
  • 26.
    ● Consumer shouldresume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss ○ Resume from smallest offset → Lots of Backlog & Duplicates ○ Resume by timestamp → Complicated & Not Reliable ○ Trying to make topic offsets the same → Nearly impossible ○ ✅ Offset manipulation by a dedicated service #4 Offset Management - Solutions
  • 27.
    #4 Offset Management- Offset Mapping Goal: no data loss ● uReplicator copies data between clusters ● uReplicator knows the offset mapping between clusters
  • 28.
    #4 Offset Management- Offset Mapping Goal: no data loss ● uReplicator copies data between clusters ● uReplicator knows the offset mapping between clusters ● Offset mappings are reported periodically into a DB ● Consuming starting from the the mapped offset pair can guarantee no data loss
  • 29.
    #4 Offset Management- Consumer Group Example for a specific topic partition 1. Consumer commits offset 17
  • 30.
    #4 Offset Management- Consumer Group Example for a specific topic partition 1. Consumer commits offset 17 2. Offset sync service a. Queries the Store, closest offset pair is (13 mapped to 29)
  • 31.
    #4 Offset Management- Consumer Group Example for a specific topic partition 1. Consumer commits offset 17 2. Offset sync service a. Queries the Store, closest offset pair is (13 mapped to 29) b. Commits offset 29 into Kafka B
  • 32.
    #4 Offset Management- Consumer Group Example for a specific topic partition 1. Consumer commits offset 17 2. Offset sync service a. Queries the Store, closest offset pair is (13 mapped to 29) b. Commits offset 29 into Kafka B 3. Consumer redirected to Cluster B a. Joins consumer group with same name b. Resumes from offset 29 -- no loss
  • 33.
    #4 Offset Management- Efficient Update Kafka __consumer_offsets internal topic ● Kafka internal storage of consumer groups ● Each message in it is a changelog of consumer groups ● All offset commits are written as Kafka messages ● Can have huge traffic (thousands of messages per second)
  • 34.
    #4 Offset Management- Efficient Update Offset Sync: A Streaming Job ● Reads from __consumer_offsets topic ● Compacts offset commits into batches ● Then converts and updates the committed offset into offset of other cluster(s) The job monitors and reports all consumer group metrics conveniently. Open source planned.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Tradeoff and limitation ●Data redundancy for higher availability: 2X replicas with 2 clusters ● Message out of order during failover transition ● Topic level federation is challenging for REST Proxy, and also for consumers that subscribe to several topics or a pattern ● Consumer has to rely on Kafka clusters to manage offsets (e.g., not friendly to some Flink consumers)
  • 42.
  • 43.
    Proprietary and confidential© 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
  • 44.
    Highly available Kafkaat Uber: Active-active ● Provide business resilience and continuity as the top priority ● Active-active in multiple regions ○ Data produced locally via Rest proxy ○ Data aggregated to agg cluster ○ Active-active consumers ● Issues ○ Failover coordination and communication required ○ Data unavailable in regional cluster during downtime
  • 45.
    Highly available Kafkaat Uber: secondary cluster ● Provide business resilience and continuity as the top priority ● When regional cluster is unavailable ○ Data produced to secondary cluster ○ Then replicated to regional when it’s back ● Issues ○ Unused capacity when regional cluster is up ○ Regional cluster unavailable for consumption during downtime
  • 46.
  • 47.
  • 48.
  • 49.