SlideShare a Scribd company logo
Kafka Cluster Federation at Uber
Yupeng Fu, Xiaoman Dong
Streaming Data Team, Uber
Apache Kafka @ Uber
PRODUCERS CONSUMERS
Real-time Analytics,
Alerts, DashboardsSamza / Flink
Applications
Data Science
Analytics
Reporting
Apache
Kafka
Vertica / Hive
Rider App
Driver App
API / Services
Etc.
Ad-hoc Exploration
ELK
Debugging
Hadoop
Surge
Mobile App
Cassandra
MySQL
DATABASES
(Internal) Services
AWS S3
Payment
PBsMessages / DayTrillions Data/day
Tens of
Thousands
Topics
Kafka Scale at Uber
excluding replication
ThousandsServices
Dozens clusters
When disaster strikes...
2 AM on Sat morning
Region 1
Kafka on-call paged, service owners paged
Emergency failover of services to another
region performed
Region 2
Region 1
What if ...
2 AM on Sat morning
Region 1
Redirect services’ traffic to another cluster
Cluster Federation: Cluster of Clusters
Kafka Users:
Kafka Team:
Cluster Federation: benefits
● Availability
○ Tolerate a single cluster downtime without user impact and region failover
● Scalability
○ Avoid giant Kafka cluster
○ Horizontally scale out Kafka clusters without disrupting users
● Ease of Operation and management
○ Easier maintenance of critical clusters like decomm, rebalance etc
○ Easier topic migration from one cluster to another
○ Easier topic discovery for users without knowing the actual clusters
High-level Design Concepts
● Users view a logical cluster
● A topic has a primary cluster and
secondary cluster(s)
● Clients fetch topic-cluster mapping and
determine which cluster to connect
● Dynamic traffic redirection of
consumers/producers without restart
● Data replication between the physical
clusters for redundancy
● Consumer progress sync between the
clusters
Design Challenges
● Producer/Consumer client traffic redirection
● Aggregate and serve the topic-to-cluster mappings
● Replication of data between clusters
● Consumer offset management
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
Architecture Overview
1. Client fetches metadata from
Kafka Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
#1 Kafka proxy for traffic redirection
● A proxy server that supports Kafka protocol of metadata requests
● Shares the same network implementation of Apache Kafka
● Routes the client to the Kafka cluster for fetch and produce
● Triggers a consumer group rebalance when the primary cluster changes
Kafka Proxy
#1 Kafka proxy and client interaction
ApiVersionRequest
Configured API version for the clusters
MetadataRequest
Metadata of the kafkaA (primary)
(Consumer)GroupCoordinatorRequest
GroupCoordinator response
kafkaA (primary)
Lookup the cache of
primary cluster
cache the primary
cluster to client
Kafka Client
metadata:
kafkaA-01
kafkaA-02
bootstrap.servers:
kafka-proxy-01
kafka-proxy-02
fetch/produce
from/to kafkaA
metadataUpdate
getLeastLoadedNode
getRandomNode
metadata:
kafkaB-01
kafkaB-02
fetch/produce
from/to kafkaB
Metadata of the kafkaB (primary)
#1 Kafka proxy internals
● Socket Server: serve the incoming metadata requests
● Metadata Provider: collect information from metadata service
● Zookeeper: local metadata cache
● Cluster Manager: manage the clients to the federated clusters
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
#2 Kafka Metadata Service
● The central service that manages the topic and cluster metadata
information
● Paired with a service that periodically syncs with all the physical
clusters
● Exposes endpoints for setting primary cluster
#2 Kafka Metadata Service
● Single entry point for topic metadata management
○ Topic creation/deletion
○ Partition expansion
○ Blacklist/Quota control etc
Metadata Service
Topic Creation
KafkaB
KafkaA
Topic
Creation
Topic
Creation
replication setup
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between
the clusters by uReplicator
4. Push-based offset sync between
the clusters
#3 Data replication - uReplicator
● Uber’s Kafka replication service derived from MirrorMaker
● Goals
○ Optimized and stable replication, e.g. rebalance only occurs during startup
○ Operate with ease, e.g. add/remove whitelists
○ Scalable, High throughput
● Open sourced: https://github.com/uber/uReplicator
● Blog: https://eng.uber.com/ureplicator/
#3 Data replication - cont’d
Improvements for Federation
● Header-based filter to avoid cyclic replication
○ Source cluster info written into message header
○ Messages will not be replicated back to its original cluster
○ Bi-directional replication becomes simple and easy
● Improved offset mapping for consumer management
Architecture Overview
1. Client fetches metadata from Kafka
Proxy
2. Metadata service manages the
global metadata
3. Data cross-replicated between the
clusters by uReplicator
4. Push-based offset sync between
the clusters
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
○ Resume from smallest offset → Lots of Backlog & Duplicates
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
○ Resume from smallest offset → Lots of Backlog & Duplicates
○ Resume by timestamp → Complicated & Not Reliable
○ Trying to make topic offsets the same → Nearly impossible
#4 Offset Management - Solutions
● Consumer should resume after switching cluster
● They will rejoin consumer group with the same name
● Offset Solutions
○ Resume from largest offset → Data Loss
○ Resume from smallest offset → Lots of Backlog & Duplicates
○ Resume by timestamp → Complicated & Not Reliable
○ Trying to make topic offsets the same → Nearly impossible
○ ✅ Offset manipulation by a dedicated service
#4 Offset Management - Solutions
#4 Offset Management - Offset Mapping
Goal: no data loss
● uReplicator copies data between clusters
● uReplicator knows the offset mapping
between clusters
#4 Offset Management - Offset Mapping
Goal: no data loss
● uReplicator copies data between clusters
● uReplicator knows the offset mapping
between clusters
● Offset mappings are reported periodically
into a DB
● Consuming starting from the the mapped
offset pair can guarantee no data loss
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
2. Offset sync service
a. Queries the Store, closest offset pair is
(13 mapped to 29)
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
2. Offset sync service
a. Queries the Store, closest offset pair is
(13 mapped to 29)
b. Commits offset 29 into Kafka B
#4 Offset Management - Consumer Group
Example for a specific topic partition
1. Consumer commits offset 17
2. Offset sync service
a. Queries the Store, closest offset pair is
(13 mapped to 29)
b. Commits offset 29 into Kafka B
3. Consumer redirected to Cluster B
a. Joins consumer group with same name
b. Resumes from offset 29 -- no loss
#4 Offset Management - Efficient Update
Kafka __consumer_offsets internal topic
● Kafka internal storage of consumer groups
● Each message in it is a changelog of consumer groups
● All offset commits are written as Kafka messages
● Can have huge traffic (thousands of messages per second)
#4 Offset Management - Efficient Update
Offset Sync: A Streaming Job
● Reads from __consumer_offsets topic
● Compacts offset commits into batches
● Then converts and updates the committed
offset into offset of other cluster(s)
The job monitors and reports all consumer group
metrics conveniently. Open source planned.
Federation In Action (1/6)
Federation In Action (2/6)
Federation In Action (3/6)
Federation In Action (4/6)
Federation In Action (5/6)
Federation In Action (6/6)
Tradeoff and limitation
● Data redundancy for higher availability: 2X replicas with 2 clusters
● Message out of order during failover transition
● Topic level federation is challenging for REST Proxy, and also for consumers
that subscribe to several topics or a pattern
● Consumer has to rely on Kafka clusters to manage offsets (e.g., not friendly to
some Flink consumers)
Q&A
Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains
information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate,
or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent
necessary for consultations with authorized personnel of Uber.
Highly available Kafka at Uber: Active-active
● Provide business resilience and continuity as
the top priority
● Active-active in multiple regions
○ Data produced locally via Rest proxy
○ Data aggregated to agg cluster
○ Active-active consumers
● Issues
○ Failover coordination and communication
required
○ Data unavailable in regional cluster during
downtime
Highly available Kafka at Uber: secondary cluster
● Provide business resilience and continuity as
the top priority
● When regional cluster is unavailable
○ Data produced to secondary cluster
○ Then replicated to regional when it’s back
● Issues
○ Unused capacity when regional cluster is
up
○ Regional cluster unavailable for
consumption during downtime
Topic Migration Challenge
Topic Migration Challenge
Topic Migration Challenge
Topic Migration Challenge

More Related Content

What's hot

Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
confluent
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
Altinity Ltd
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
Ketan Gote
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
pflueras
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
confluent
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clement Demonchy
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Diego Pacheco
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 

What's hot (20)

Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

Similar to Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summit SF 2019

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
José Román Martín Gil
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
Sumit Jain
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Apache kafkaApache kafka
Apache kafka
Kumar Shivam
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Apache Kafka
Apache Kafka Apache Kafka
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
Ankur Bansal
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
Nitin Kumar
 
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsPortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Timothy Spann
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
HostedbyConfluent
 

Similar to Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summit SF 2019 (20)

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsPortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
 
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
Fan-out, fan-in & the multiplexer: Replication recipes for global platform di...
 

More from confluent

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
confluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
confluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
confluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
confluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
confluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
confluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
confluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
confluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
confluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
confluent
 

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 

Recently uploaded

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
Ayan Halder
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 

Recently uploaded (20)

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Requirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional SafetyRequirement Traceability in Xen Functional Safety
Requirement Traceability in Xen Functional Safety
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summit SF 2019

  • 1. Kafka Cluster Federation at Uber Yupeng Fu, Xiaoman Dong Streaming Data Team, Uber
  • 2. Apache Kafka @ Uber PRODUCERS CONSUMERS Real-time Analytics, Alerts, DashboardsSamza / Flink Applications Data Science Analytics Reporting Apache Kafka Vertica / Hive Rider App Driver App API / Services Etc. Ad-hoc Exploration ELK Debugging Hadoop Surge Mobile App Cassandra MySQL DATABASES (Internal) Services AWS S3 Payment
  • 3. PBsMessages / DayTrillions Data/day Tens of Thousands Topics Kafka Scale at Uber excluding replication ThousandsServices Dozens clusters
  • 4. When disaster strikes... 2 AM on Sat morning Region 1 Kafka on-call paged, service owners paged Emergency failover of services to another region performed Region 2 Region 1
  • 5. What if ... 2 AM on Sat morning Region 1 Redirect services’ traffic to another cluster
  • 6. Cluster Federation: Cluster of Clusters Kafka Users: Kafka Team:
  • 7. Cluster Federation: benefits ● Availability ○ Tolerate a single cluster downtime without user impact and region failover ● Scalability ○ Avoid giant Kafka cluster ○ Horizontally scale out Kafka clusters without disrupting users ● Ease of Operation and management ○ Easier maintenance of critical clusters like decomm, rebalance etc ○ Easier topic migration from one cluster to another ○ Easier topic discovery for users without knowing the actual clusters
  • 8. High-level Design Concepts ● Users view a logical cluster ● A topic has a primary cluster and secondary cluster(s) ● Clients fetch topic-cluster mapping and determine which cluster to connect ● Dynamic traffic redirection of consumers/producers without restart ● Data replication between the physical clusters for redundancy ● Consumer progress sync between the clusters
  • 9. Design Challenges ● Producer/Consumer client traffic redirection ● Aggregate and serve the topic-to-cluster mappings ● Replication of data between clusters ● Consumer offset management
  • 10. Architecture Overview 1. Client fetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 11. Architecture Overview 1. Client fetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 12. #1 Kafka proxy for traffic redirection ● A proxy server that supports Kafka protocol of metadata requests ● Shares the same network implementation of Apache Kafka ● Routes the client to the Kafka cluster for fetch and produce ● Triggers a consumer group rebalance when the primary cluster changes
  • 13. Kafka Proxy #1 Kafka proxy and client interaction ApiVersionRequest Configured API version for the clusters MetadataRequest Metadata of the kafkaA (primary) (Consumer)GroupCoordinatorRequest GroupCoordinator response kafkaA (primary) Lookup the cache of primary cluster cache the primary cluster to client Kafka Client metadata: kafkaA-01 kafkaA-02 bootstrap.servers: kafka-proxy-01 kafka-proxy-02 fetch/produce from/to kafkaA metadataUpdate getLeastLoadedNode getRandomNode metadata: kafkaB-01 kafkaB-02 fetch/produce from/to kafkaB Metadata of the kafkaB (primary)
  • 14. #1 Kafka proxy internals ● Socket Server: serve the incoming metadata requests ● Metadata Provider: collect information from metadata service ● Zookeeper: local metadata cache ● Cluster Manager: manage the clients to the federated clusters
  • 15. Architecture Overview 1. Client fetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 16. #2 Kafka Metadata Service ● The central service that manages the topic and cluster metadata information ● Paired with a service that periodically syncs with all the physical clusters ● Exposes endpoints for setting primary cluster
  • 17. #2 Kafka Metadata Service ● Single entry point for topic metadata management ○ Topic creation/deletion ○ Partition expansion ○ Blacklist/Quota control etc Metadata Service Topic Creation KafkaB KafkaA Topic Creation Topic Creation replication setup
  • 18. Architecture Overview 1. Client fetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 19. #3 Data replication - uReplicator ● Uber’s Kafka replication service derived from MirrorMaker ● Goals ○ Optimized and stable replication, e.g. rebalance only occurs during startup ○ Operate with ease, e.g. add/remove whitelists ○ Scalable, High throughput ● Open sourced: https://github.com/uber/uReplicator ● Blog: https://eng.uber.com/ureplicator/
  • 20. #3 Data replication - cont’d Improvements for Federation ● Header-based filter to avoid cyclic replication ○ Source cluster info written into message header ○ Messages will not be replicated back to its original cluster ○ Bi-directional replication becomes simple and easy ● Improved offset mapping for consumer management
  • 21. Architecture Overview 1. Client fetches metadata from Kafka Proxy 2. Metadata service manages the global metadata 3. Data cross-replicated between the clusters by uReplicator 4. Push-based offset sync between the clusters
  • 22. ● Consumer should resume after switching cluster ● They will rejoin consumer group with the same name #4 Offset Management - Solutions
  • 23. ● Consumer should resume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss #4 Offset Management - Solutions
  • 24. ● Consumer should resume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss ○ Resume from smallest offset → Lots of Backlog & Duplicates #4 Offset Management - Solutions
  • 25. ● Consumer should resume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss ○ Resume from smallest offset → Lots of Backlog & Duplicates ○ Resume by timestamp → Complicated & Not Reliable ○ Trying to make topic offsets the same → Nearly impossible #4 Offset Management - Solutions
  • 26. ● Consumer should resume after switching cluster ● They will rejoin consumer group with the same name ● Offset Solutions ○ Resume from largest offset → Data Loss ○ Resume from smallest offset → Lots of Backlog & Duplicates ○ Resume by timestamp → Complicated & Not Reliable ○ Trying to make topic offsets the same → Nearly impossible ○ ✅ Offset manipulation by a dedicated service #4 Offset Management - Solutions
  • 27. #4 Offset Management - Offset Mapping Goal: no data loss ● uReplicator copies data between clusters ● uReplicator knows the offset mapping between clusters
  • 28. #4 Offset Management - Offset Mapping Goal: no data loss ● uReplicator copies data between clusters ● uReplicator knows the offset mapping between clusters ● Offset mappings are reported periodically into a DB ● Consuming starting from the the mapped offset pair can guarantee no data loss
  • 29. #4 Offset Management - Consumer Group Example for a specific topic partition 1. Consumer commits offset 17
  • 30. #4 Offset Management - Consumer Group Example for a specific topic partition 1. Consumer commits offset 17 2. Offset sync service a. Queries the Store, closest offset pair is (13 mapped to 29)
  • 31. #4 Offset Management - Consumer Group Example for a specific topic partition 1. Consumer commits offset 17 2. Offset sync service a. Queries the Store, closest offset pair is (13 mapped to 29) b. Commits offset 29 into Kafka B
  • 32. #4 Offset Management - Consumer Group Example for a specific topic partition 1. Consumer commits offset 17 2. Offset sync service a. Queries the Store, closest offset pair is (13 mapped to 29) b. Commits offset 29 into Kafka B 3. Consumer redirected to Cluster B a. Joins consumer group with same name b. Resumes from offset 29 -- no loss
  • 33. #4 Offset Management - Efficient Update Kafka __consumer_offsets internal topic ● Kafka internal storage of consumer groups ● Each message in it is a changelog of consumer groups ● All offset commits are written as Kafka messages ● Can have huge traffic (thousands of messages per second)
  • 34. #4 Offset Management - Efficient Update Offset Sync: A Streaming Job ● Reads from __consumer_offsets topic ● Compacts offset commits into batches ● Then converts and updates the committed offset into offset of other cluster(s) The job monitors and reports all consumer group metrics conveniently. Open source planned.
  • 41. Tradeoff and limitation ● Data redundancy for higher availability: 2X replicas with 2 clusters ● Message out of order during failover transition ● Topic level federation is challenging for REST Proxy, and also for consumers that subscribe to several topics or a pattern ● Consumer has to rely on Kafka clusters to manage offsets (e.g., not friendly to some Flink consumers)
  • 42. Q&A
  • 43. Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
  • 44. Highly available Kafka at Uber: Active-active ● Provide business resilience and continuity as the top priority ● Active-active in multiple regions ○ Data produced locally via Rest proxy ○ Data aggregated to agg cluster ○ Active-active consumers ● Issues ○ Failover coordination and communication required ○ Data unavailable in regional cluster during downtime
  • 45. Highly available Kafka at Uber: secondary cluster ● Provide business resilience and continuity as the top priority ● When regional cluster is unavailable ○ Data produced to secondary cluster ○ Then replicated to regional when it’s back ● Issues ○ Unused capacity when regional cluster is up ○ Regional cluster unavailable for consumption during downtime