SlideShare a Scribd company logo
1 of 40
Download to read offline
Scaling Uber’s Real-Time Infra for
Trillion Events per Day
Ankur Bansal
Mingmin Chen
Hadoop Summit
June 14, 2017
Ankur Bansal
● Sr. Software Engineer,
Streaming Team @ Uber
○ Apache Kafka
● Staff Software Engineer @ ebay
○ Build & scale Ebay’s cloud
using openstack
● Apache Kylin: Committer,
Emeritus PMC
About the Speakers
Mingmin Chen
● Sr. Software Engineer,
Streaming Team @ Uber
○ Apache Kafka
● Distributed Data Systems @
Twitter
○ Apache Hive
○ Apache Spark
● Exadata @ Oracle
Agenda
● Use Cases & Current Scale
● How We Scaled The Infrastructure
○ Rest Proxy & Clients
○ Local Agent
○ uReplicator (Mirrormaker)
● Reliable Messaging & Tooling
○ At-Least-Once
○ Chaperone (Auditing)
○ Cluster Balancing
● Future Work
Use Cases
Real-time Driver-Rider Matching
Stream
Processing
- Driver-Rider Match
- ETA
App Views
Vehicle information
KAFKA
UberEATS - Real-Time ETAs
A bunch more...
● Fraud Detection
● Share My ETA
● Driver & Rider Signups
● Etc.
Apache Kafka is Uber’s Data Hub
Kafka - Use Cases
● General Pub-Sub
● Stream Processing
○ AthenaX - Self-Serve Platform (Samza, Flink)
● Ingestion
○ HDFS, S3
● Database Changelog Transport
○ Schemaless, Cassandra, MySQL
● Logging
Scale
* obligatory show-off slide
Trillion+ ~PBs
Messages/Day Data Volume
Scale
excluding replication
Tens of Thousands
Topics
Ecosystem @ Uber
PRODUCERS
CONSUMERS
Real-time
Analytics, Alerts,
Dashboards
Samza / Flink
Applications
Data Science
Analytics
Reporting
Kafka
Vertica / Hive
Rider App
Driver App
API / Services
Etc.
Ad-hoc Exploration
ELK
Ecosystem @ Uber
Debugging
Hadoop
Surge Mobile App
Cassandra
Schemaless
MySQL
AWS S3
Requirements
● Scale Horizontally
● API Latency (<5ms typically)
● Availability -> 99.99%
● Durability -> 99.99%; 100% -> Critical Customers
● Multi-DC Replication
● Multi-Language Support
○ Java, Go, Python, Node.js, C++
● Auditing
Kafka Pipeline
Local
Agent
uReplicator
Data Flow: Batching everywhere
1
2
3 5 7
64 8
1 1
Kafka Clusters
Local
Agent
uReplicator
Kafka Clusters
● Running Kafka 0.10
● Use Case-based
○ Data
○ Logging
○ Database Changelogs
○ Highly Isolated & Reliable e.g. Surge
○ High Value Data (e.g. Signups)
● Fallback Secondary Clusters
● Global Aggregates
Kafka Rest Proxy
Local
Agent
uReplicator
Why Kafka Rest Proxy ?
● Simplified Client API
○ Multi-lang Support
● Decouple Client With Kafka broker
○ Thin Clients = Operational Ease
○ Easier Kafka Upgrades
● Enhanced Reliability
○ Quota Management
○ Primary & Secondary Clusters
Kafka Rest Proxy: Internals
● Based on Confluent’s open sourced Rest Proxy
● Performance enhancements
○ Simple HTTP servlets on jetty instead of Jersey
○ Optimized for binary payloads.
○ Performance increase from 7K* to 45K QPS/box
● Caching of topic metadata
● Reliability improvements*
○ Support for Fallback cluster
○ Support for multiple producers (SLA-based segregation)
● Plan to contribute back to community
*Based on benchmarking & analysis done in Jun ’2015
Rest Proxy: Performance (1 box)
Message rate (K/second) at single node
End-endLatency(ms)
Producer Libraries
● High Throughput
○ Non-blocking, async, batched
○ Back off when throttled
● ‘Topic Discovery’
○ Discovers the kafka cluster a topic belongs
○ Able to multiplex to different kafka clusters
● Local Agent for Critical Data
Kafka Local Agent
Local
Agent
uReplicator
Local Agent
● Producer side persistence
○ Local storage
● Isolates clients from downstream outages, backpressure
● Controlled backfill upon recovery
○ Prevents from overwhelming a recovering cluster
Local Agent in Action
Add
Figure
uReplicator
Local
Agent
uReplicator
uReplicator
● In-house Intercluster Replication Solution
○ Apache Helix-based
○ Mirror all traffic between & within DCs
○ Lower rebalance latencies
● Running in Production ~2 Years
● Open Sourced: https://github.com/uber/uReplicator
● Uber Engineering Blog: https://eng.uber.com/ureplicator/
At-Least-Once
At-Least-Once
Application Process Kafka Proxy Server uReplicator
1
2
3 5 7
64 8
Regional Kafka Aggregate Kafka
● Most of infrastructure tuned for high throughput
○ Batching at each stage
○ Ack before being persisted (ack’ed != committed)
● Single node failure in any stage leads to data loss
● Need a reliable pipeline for High Value Data e.g. Payments
At-least-once Kafka: Data Flow
Application Process
ProxyClient
Kafka Proxy Server uReplicator
1
6
2 3 7
45 8
Regional Kafka Aggregate Kafka
Auditing - Chaperone
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Chaperone - Track Counts
CONFIDENTIAL
>> INSERT SCREENSHOT HERE <<
Chaperone - Track Latency
Chaperone - End to End Auditing
● In-house Auditing Solution for Kafka
● Running in Production for ~2 Years
○ Audit 20k+ topics for 99.99% completeness
● Open Sourced: https://github.com/uber/chaperone
● Uber Engineering Blog: https://eng.uber.com/chaperone/
Cluster Balancing
● No Auto Rebalancing
● Manual Placement is Hard
● Auto Plan Generation
○ And execution!
Cluster Balancing
Future Work
Future Work
● Multi-zone Clusters
○ Durability during DC wide outages
● Chargebacks
● Efficiency Enhancements
○ Intelligent aggregates, automated topic GC etc..
● uReplicator Enhancements
● Open Source
● And Much More..
More open-source projects at eng.uber.com

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain confluent
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 

Similar to Scaling Uber's Real-Time Infra for Trillion Events Daily

Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupMingmin Chen
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaHotstar
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafkaconfluent
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with KubernetesRaghavendra Prabhu
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 

Similar to Scaling Uber's Real-Time Infra for Trillion Events Daily (20)

Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netty training
Netty trainingNetty training
Netty training
 
Netty training
Netty trainingNetty training
Netty training
 
A Tour of Apache Kafka
A Tour of Apache KafkaA Tour of Apache Kafka
A Tour of Apache Kafka
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Scaling Uber's Real-Time Infra for Trillion Events Daily

  • 1. Scaling Uber’s Real-Time Infra for Trillion Events per Day Ankur Bansal Mingmin Chen Hadoop Summit June 14, 2017
  • 2. Ankur Bansal ● Sr. Software Engineer, Streaming Team @ Uber ○ Apache Kafka ● Staff Software Engineer @ ebay ○ Build & scale Ebay’s cloud using openstack ● Apache Kylin: Committer, Emeritus PMC About the Speakers Mingmin Chen ● Sr. Software Engineer, Streaming Team @ Uber ○ Apache Kafka ● Distributed Data Systems @ Twitter ○ Apache Hive ○ Apache Spark ● Exadata @ Oracle
  • 3. Agenda ● Use Cases & Current Scale ● How We Scaled The Infrastructure ○ Rest Proxy & Clients ○ Local Agent ○ uReplicator (Mirrormaker) ● Reliable Messaging & Tooling ○ At-Least-Once ○ Chaperone (Auditing) ○ Cluster Balancing ● Future Work
  • 5. Real-time Driver-Rider Matching Stream Processing - Driver-Rider Match - ETA App Views Vehicle information KAFKA
  • 7. A bunch more... ● Fraud Detection ● Share My ETA ● Driver & Rider Signups ● Etc.
  • 8. Apache Kafka is Uber’s Data Hub
  • 9. Kafka - Use Cases ● General Pub-Sub ● Stream Processing ○ AthenaX - Self-Serve Platform (Samza, Flink) ● Ingestion ○ HDFS, S3 ● Database Changelog Transport ○ Schemaless, Cassandra, MySQL ● Logging
  • 11. Trillion+ ~PBs Messages/Day Data Volume Scale excluding replication Tens of Thousands Topics
  • 13. PRODUCERS CONSUMERS Real-time Analytics, Alerts, Dashboards Samza / Flink Applications Data Science Analytics Reporting Kafka Vertica / Hive Rider App Driver App API / Services Etc. Ad-hoc Exploration ELK Ecosystem @ Uber Debugging Hadoop Surge Mobile App Cassandra Schemaless MySQL AWS S3
  • 14. Requirements ● Scale Horizontally ● API Latency (<5ms typically) ● Availability -> 99.99% ● Durability -> 99.99%; 100% -> Critical Customers ● Multi-DC Replication ● Multi-Language Support ○ Java, Go, Python, Node.js, C++ ● Auditing
  • 16. Data Flow: Batching everywhere 1 2 3 5 7 64 8 1 1
  • 18. Kafka Clusters ● Running Kafka 0.10 ● Use Case-based ○ Data ○ Logging ○ Database Changelogs ○ Highly Isolated & Reliable e.g. Surge ○ High Value Data (e.g. Signups) ● Fallback Secondary Clusters ● Global Aggregates
  • 20. Why Kafka Rest Proxy ? ● Simplified Client API ○ Multi-lang Support ● Decouple Client With Kafka broker ○ Thin Clients = Operational Ease ○ Easier Kafka Upgrades ● Enhanced Reliability ○ Quota Management ○ Primary & Secondary Clusters
  • 21. Kafka Rest Proxy: Internals ● Based on Confluent’s open sourced Rest Proxy ● Performance enhancements ○ Simple HTTP servlets on jetty instead of Jersey ○ Optimized for binary payloads. ○ Performance increase from 7K* to 45K QPS/box ● Caching of topic metadata ● Reliability improvements* ○ Support for Fallback cluster ○ Support for multiple producers (SLA-based segregation) ● Plan to contribute back to community *Based on benchmarking & analysis done in Jun ’2015
  • 22. Rest Proxy: Performance (1 box) Message rate (K/second) at single node End-endLatency(ms)
  • 23. Producer Libraries ● High Throughput ○ Non-blocking, async, batched ○ Back off when throttled ● ‘Topic Discovery’ ○ Discovers the kafka cluster a topic belongs ○ Able to multiplex to different kafka clusters ● Local Agent for Critical Data
  • 25. Local Agent ● Producer side persistence ○ Local storage ● Isolates clients from downstream outages, backpressure ● Controlled backfill upon recovery ○ Prevents from overwhelming a recovering cluster
  • 26. Local Agent in Action Add Figure
  • 28. uReplicator ● In-house Intercluster Replication Solution ○ Apache Helix-based ○ Mirror all traffic between & within DCs ○ Lower rebalance latencies ● Running in Production ~2 Years ● Open Sourced: https://github.com/uber/uReplicator ● Uber Engineering Blog: https://eng.uber.com/ureplicator/
  • 30. At-Least-Once Application Process Kafka Proxy Server uReplicator 1 2 3 5 7 64 8 Regional Kafka Aggregate Kafka ● Most of infrastructure tuned for high throughput ○ Batching at each stage ○ Ack before being persisted (ack’ed != committed) ● Single node failure in any stage leads to data loss ● Need a reliable pipeline for High Value Data e.g. Payments
  • 31. At-least-once Kafka: Data Flow Application Process ProxyClient Kafka Proxy Server uReplicator 1 6 2 3 7 45 8 Regional Kafka Aggregate Kafka
  • 33. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone - Track Counts
  • 34. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone - Track Latency
  • 35. Chaperone - End to End Auditing ● In-house Auditing Solution for Kafka ● Running in Production for ~2 Years ○ Audit 20k+ topics for 99.99% completeness ● Open Sourced: https://github.com/uber/chaperone ● Uber Engineering Blog: https://eng.uber.com/chaperone/
  • 36. Cluster Balancing ● No Auto Rebalancing ● Manual Placement is Hard ● Auto Plan Generation ○ And execution!
  • 39. Future Work ● Multi-zone Clusters ○ Durability during DC wide outages ● Chargebacks ● Efficiency Enhancements ○ Intelligent aggregates, automated topic GC etc.. ● uReplicator Enhancements ● Open Source ● And Much More..
  • 40. More open-source projects at eng.uber.com