Disaster Recovery for Multi-Region
Apache Kafka Ecosystems at Uber
Yupeng Fu
Streaming Data Team, Uber
Apr 29, 2019
About myself
● Yupeng Fu
● Staff Engineer @ Uber
● Streaming Data
● Worked at Alluxio, Palantir
● UCSD & Tsinghua
Data Infrastructure @ Uber
PRODUCERS CONSUMERS
Real-time Analytics,
Alerts, DashboardsSamza / Flink
Applications
Data Science
Analytics
Reporting
Apache
Kafka
Vertica / Hive
Rider App
Driver App
API / Services
Etc.
Ad-hoc Exploration
ELK
Debugging
Hadoop
Surge
Mobile App
Cassandra
MySQL
DATABASES
(Internal) Services
AWS S3
Payment
Apache Kafka at Uber
● General pub-sub, messaging queue
● Stream processing
○ AthenaX - self-service streaming analytics platform (Apache Samza &
Apache Flink)
● Database changelog transport
○ Cassandra, MySQL, etc.
● Ingestion into data lake
○ HDFS, S3
● Logging
PBsMessages / DayTrillions Data/day
Tens of
Thousands
Topics
Scale
excluding replication
ThousandsServices
Multi-region at Uber
● Provide business resilience and continuity as the top priority
○ Survive outages and disasters without major business impact
○ Region isolation to avoid cascading failure
● Take good care of customer experiences
○ Serve user requests in a closer region
○ Data integrity and consistency matters
● Improve infrastructure flexibility and efficiency
○ Decease compliance and policy risks
○ Leverage both on-premise and cloud partners
Considerations for apps/services
● Highly available
○ Auto and on-demand region failover
● Highly flexible
○ Stateless and mobile
○ Data sharded by Geo
● Tradeoffs in SLA
○ Local data vs aggregated view
○ Latency vs consistency
● Leverage active-active storage layer for state
sharing
Considerations for Apache Kafka
● Producer
○ Data produced locally
Considerations for Apache Kafka
● Producer
○ Data produced locally
● Data aggregation
○ Topics replicated to agg clusters
Considerations for Apache Kafka
● Producer
○ Data produced locally
● Data aggregation
○ Topics replicated to agg clusters
● Active-active consumers
○ Double compute
○ Data ingestion
Active-active example: surge
● Real-time dynamic pricing
● Critical service with strict SLA
● Heavy distributed computation
● Large memory footprint
● Latency over consistency
Dynamic pricing
Rider
Driver
Active-active example: surge
Data replication - uReplicator
● Uber’s Apache Kafka replication service
● Goals
○ Stable replication, e.g. rebalance only occurs during startup
○ Operate with ease, e.g. add/remove whitelists
○ Scalable
○ High throughput
● Open sourced: https://github.com/uber/uReplicator
● Blog: https://eng.uber.com/ureplicator/
Considerations for Apache Kafka
● Producer
○ Data produced locally
● Data aggregation
○ Topics replicated to agg clusters
● Active-active consumers
○ Double compute
○ Data ingestion
● Active-passive consumers
○ Consistency sensitive apps
○ Challenge on offset sync
Offset sync - challenges
● Requirements
○ No data loss -> cannot resume from
largest offset
○ Reduce duplicates -> cannot resume
from smallest offset
● Constraints
○ Not all messages have timestamp
○ Messages in the agg cluster out of order
due to the merge
Offset sync - architecture
● uReplicator reports the offset from src to
dst to the offset manager
● Offset manager
○ Stores the checkpoints state
○ Translates the offsets mapping
● Sync job periodically translates the offsets
and pushes the new offsets
● Internal consumer looks up the offsets
Offset sync - checkpoint
src-cluster src-offset dst-cluster dst-offset
R1-region 1 R1-agg 1
R2-region 1 R2-agg 1
R2-region 1 R1-agg 3
R1-region 1 R2-agg 3
R1-region 3 R1-agg 5
R2-region 3 R2-agg 5
R2-region 3 R1-agg 7
R1-region 3 R2-agg 7
11
12
21
22
11
12
21
22
21
22
11
12
13
14
23
24
13
14
23
24
23
24
13
14
Offset sync - translation
11
12
21
22
11
12
21
22
21
22
11
12
13
14
23
24
13
14
23
24
23
24
13
14
● Find the mapped offset during the failover
○ Find the src offsets from the most recent
checkpoints
○ Take the min of the checkpointed offsets
on the failed over agg cluster
6
3
5
1
3
1
7
Offset sync - active-passive producer
11
12
21
11
12
21
21
11
12
13
14
13
14
● Find the mapped offset during the failover
○ Find the src offsets from the most recent
checkpoints
○ Take the min of the checkpointed offsets
on the failed over agg cluster, ignore the
checkpoints when the src offset is the
latest
3
13
14
15
16
15
16
15
16
Q&A
Proprietary and confidential © 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains
information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate,
or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent
necessary for consultations with authorized personnel of Uber.
Motivation - Why not MirrorMaker
● Pain point
○ Expensive rebalancing
○ Difficulty adding topics
○ Possible data loss
○ Metadata sync issues

Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber

  • 1.
    Disaster Recovery forMulti-Region Apache Kafka Ecosystems at Uber Yupeng Fu Streaming Data Team, Uber Apr 29, 2019
  • 2.
    About myself ● YupengFu ● Staff Engineer @ Uber ● Streaming Data ● Worked at Alluxio, Palantir ● UCSD & Tsinghua
  • 3.
    Data Infrastructure @Uber PRODUCERS CONSUMERS Real-time Analytics, Alerts, DashboardsSamza / Flink Applications Data Science Analytics Reporting Apache Kafka Vertica / Hive Rider App Driver App API / Services Etc. Ad-hoc Exploration ELK Debugging Hadoop Surge Mobile App Cassandra MySQL DATABASES (Internal) Services AWS S3 Payment
  • 4.
    Apache Kafka atUber ● General pub-sub, messaging queue ● Stream processing ○ AthenaX - self-service streaming analytics platform (Apache Samza & Apache Flink) ● Database changelog transport ○ Cassandra, MySQL, etc. ● Ingestion into data lake ○ HDFS, S3 ● Logging
  • 5.
    PBsMessages / DayTrillionsData/day Tens of Thousands Topics Scale excluding replication ThousandsServices
  • 6.
    Multi-region at Uber ●Provide business resilience and continuity as the top priority ○ Survive outages and disasters without major business impact ○ Region isolation to avoid cascading failure ● Take good care of customer experiences ○ Serve user requests in a closer region ○ Data integrity and consistency matters ● Improve infrastructure flexibility and efficiency ○ Decease compliance and policy risks ○ Leverage both on-premise and cloud partners
  • 7.
    Considerations for apps/services ●Highly available ○ Auto and on-demand region failover ● Highly flexible ○ Stateless and mobile ○ Data sharded by Geo ● Tradeoffs in SLA ○ Local data vs aggregated view ○ Latency vs consistency ● Leverage active-active storage layer for state sharing
  • 8.
    Considerations for ApacheKafka ● Producer ○ Data produced locally
  • 9.
    Considerations for ApacheKafka ● Producer ○ Data produced locally ● Data aggregation ○ Topics replicated to agg clusters
  • 10.
    Considerations for ApacheKafka ● Producer ○ Data produced locally ● Data aggregation ○ Topics replicated to agg clusters ● Active-active consumers ○ Double compute ○ Data ingestion
  • 11.
    Active-active example: surge ●Real-time dynamic pricing ● Critical service with strict SLA ● Heavy distributed computation ● Large memory footprint ● Latency over consistency Dynamic pricing Rider Driver
  • 12.
  • 13.
    Data replication -uReplicator ● Uber’s Apache Kafka replication service ● Goals ○ Stable replication, e.g. rebalance only occurs during startup ○ Operate with ease, e.g. add/remove whitelists ○ Scalable ○ High throughput ● Open sourced: https://github.com/uber/uReplicator ● Blog: https://eng.uber.com/ureplicator/
  • 14.
    Considerations for ApacheKafka ● Producer ○ Data produced locally ● Data aggregation ○ Topics replicated to agg clusters ● Active-active consumers ○ Double compute ○ Data ingestion ● Active-passive consumers ○ Consistency sensitive apps ○ Challenge on offset sync
  • 15.
    Offset sync -challenges ● Requirements ○ No data loss -> cannot resume from largest offset ○ Reduce duplicates -> cannot resume from smallest offset ● Constraints ○ Not all messages have timestamp ○ Messages in the agg cluster out of order due to the merge
  • 16.
    Offset sync -architecture ● uReplicator reports the offset from src to dst to the offset manager ● Offset manager ○ Stores the checkpoints state ○ Translates the offsets mapping ● Sync job periodically translates the offsets and pushes the new offsets ● Internal consumer looks up the offsets
  • 17.
    Offset sync -checkpoint src-cluster src-offset dst-cluster dst-offset R1-region 1 R1-agg 1 R2-region 1 R2-agg 1 R2-region 1 R1-agg 3 R1-region 1 R2-agg 3 R1-region 3 R1-agg 5 R2-region 3 R2-agg 5 R2-region 3 R1-agg 7 R1-region 3 R2-agg 7 11 12 21 22 11 12 21 22 21 22 11 12 13 14 23 24 13 14 23 24 23 24 13 14
  • 18.
    Offset sync -translation 11 12 21 22 11 12 21 22 21 22 11 12 13 14 23 24 13 14 23 24 23 24 13 14 ● Find the mapped offset during the failover ○ Find the src offsets from the most recent checkpoints ○ Take the min of the checkpointed offsets on the failed over agg cluster 6 3 5 1 3 1 7
  • 19.
    Offset sync -active-passive producer 11 12 21 11 12 21 21 11 12 13 14 13 14 ● Find the mapped offset during the failover ○ Find the src offsets from the most recent checkpoints ○ Take the min of the checkpointed offsets on the failed over agg cluster, ignore the checkpoints when the src offset is the latest 3 13 14 15 16 15 16 15 16
  • 20.
  • 21.
    Proprietary and confidential© 2019 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.
  • 22.
    Motivation - Whynot MirrorMaker ● Pain point ○ Expensive rebalancing ○ Difficulty adding topics ○ Possible data loss ○ Metadata sync issues