Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Uber scaled its Real Time Infrastructure to Trillion events per day

1,960 views

Published on

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.

Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

Published in: Technology
  • Be the first to comment

How Uber scaled its Real Time Infrastructure to Trillion events per day

  1. 1. Scaling Uber’s Real-Time Infra for Trillion Events per Day Ankur Bansal Mingmin Chen Hadoop Summit June 14, 2017
  2. 2. Ankur Bansal ● Sr. Software Engineer, Streaming Team @ Uber ○ Apache Kafka ● Staff Software Engineer @ ebay ○ Build & scale Ebay’s cloud using openstack ● Apache Kylin: Committer, Emeritus PMC About the Speakers Mingmin Chen ● Sr. Software Engineer, Streaming Team @ Uber ○ Apache Kafka ● Distributed Data Systems @ Twitter ○ Apache Hive ○ Apache Spark ● Exadata @ Oracle
  3. 3. Agenda ● Use Cases & Current Scale ● How We Scaled The Infrastructure ○ Rest Proxy & Clients ○ Local Agent ○ uReplicator (Mirrormaker) ● Reliable Messaging & Tooling ○ At-Least-Once ○ Chaperone (Auditing) ○ Cluster Balancing ● Future Work
  4. 4. Use Cases
  5. 5. Real-time Driver-Rider Matching Stream Processing - Driver-Rider Match - ETA App Views Vehicle information KAFKA
  6. 6. UberEATS - Real-Time ETAs
  7. 7. A bunch more... ● Fraud Detection ● Share My ETA ● Driver & Rider Signups ● Etc.
  8. 8. Apache Kafka is Uber’s Data Hub
  9. 9. Kafka - Use Cases ● General Pub-Sub ● Stream Processing ○ AthenaX - Self-Serve Platform (Samza, Flink) ● Ingestion ○ HDFS, S3 ● Database Changelog Transport ○ Schemaless, Cassandra, MySQL ● Logging
  10. 10. Scale * obligatory show-off slide
  11. 11. Trillion+ ~PBs Messages/Day Data Volume Scale excluding replication Tens of Thousands Topics
  12. 12. Ecosystem @ Uber
  13. 13. PRODUCERS CONSUMERS Real-time Analytics, Alerts, Dashboards Samza / Flink Applications Data Science Analytics Reporting Kafka Vertica / Hive Rider App Driver App API / Services Etc. Ad-hoc Exploration ELK Ecosystem @ Uber Debugging Hadoop Surge Mobile App Cassandra Schemaless MySQL AWS S3
  14. 14. Requirements ● Scale Horizontally ● API Latency (<5ms typically) ● Availability -> 99.99% ● Durability -> 99.99%; 100% -> Critical Customers ● Multi-DC Replication ● Multi-Language Support ○ Java, Go, Python, Node.js, C++ ● Auditing
  15. 15. Kafka Pipeline Local Agent uReplicator
  16. 16. Data Flow: Batching everywhere 1 2 3 5 7 64 8 1 1
  17. 17. Kafka Clusters Local Agent uReplicator
  18. 18. Kafka Clusters ● Running Kafka 0.10 ● Use Case-based ○ Data ○ Logging ○ Database Changelogs ○ Highly Isolated & Reliable e.g. Surge ○ High Value Data (e.g. Signups) ● Fallback Secondary Clusters ● Global Aggregates
  19. 19. Kafka Rest Proxy Local Agent uReplicator
  20. 20. Why Kafka Rest Proxy ? ● Simplified Client API ○ Multi-lang Support ● Decouple Client With Kafka broker ○ Thin Clients = Operational Ease ○ Easier Kafka Upgrades ● Enhanced Reliability ○ Quota Management ○ Primary & Secondary Clusters
  21. 21. Kafka Rest Proxy: Internals ● Based on Confluent’s open sourced Rest Proxy ● Performance enhancements ○ Simple HTTP servlets on jetty instead of Jersey ○ Optimized for binary payloads. ○ Performance increase from 7K* to 45K QPS/box ● Caching of topic metadata ● Reliability improvements* ○ Support for Fallback cluster ○ Support for multiple producers (SLA-based segregation) ● Plan to contribute back to community *Based on benchmarking & analysis done in Jun ’2015
  22. 22. Rest Proxy: Performance (1 box) Message rate (K/second) at single node End-endLatency(ms)
  23. 23. Producer Libraries ● High Throughput ○ Non-blocking, async, batched ○ Back off when throttled ● ‘Topic Discovery’ ○ Discovers the kafka cluster a topic belongs ○ Able to multiplex to different kafka clusters ● Local Agent for Critical Data
  24. 24. Kafka Local Agent Local Agent uReplicator
  25. 25. Local Agent ● Producer side persistence ○ Local storage ● Isolates clients from downstream outages, backpressure ● Controlled backfill upon recovery ○ Prevents from overwhelming a recovering cluster
  26. 26. Local Agent in Action Add Figure
  27. 27. uReplicator Local Agent uReplicator
  28. 28. uReplicator ● In-house Intercluster Replication Solution ○ Apache Helix-based ○ Mirror all traffic between & within DCs ○ Lower rebalance latencies ● Running in Production ~2 Years ● Open Sourced: https://github.com/uber/uReplicator ● Uber Engineering Blog: https://eng.uber.com/ureplicator/
  29. 29. At-Least-Once
  30. 30. At-Least-Once Application Process Kafka Proxy Server uReplicator 1 2 3 5 7 64 8 Regional Kafka Aggregate Kafka ● Most of infrastructure tuned for high throughput ○ Batching at each stage ○ Ack before being persisted (ack’ed != committed) ● Single node failure in any stage leads to data loss ● Need a reliable pipeline for High Value Data e.g. Payments
  31. 31. At-least-once Kafka: Data Flow Application Process ProxyClient Kafka Proxy Server uReplicator 1 6 2 3 7 45 8 Regional Kafka Aggregate Kafka
  32. 32. Auditing - Chaperone
  33. 33. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone - Track Counts
  34. 34. CONFIDENTIAL >> INSERT SCREENSHOT HERE << Chaperone - Track Latency
  35. 35. Chaperone - End to End Auditing ● In-house Auditing Solution for Kafka ● Running in Production for ~2 Years ○ Audit 20k+ topics for 99.99% completeness ● Open Sourced: https://github.com/uber/chaperone ● Uber Engineering Blog: https://eng.uber.com/chaperone/
  36. 36. Cluster Balancing ● No Auto Rebalancing ● Manual Placement is Hard ● Auto Plan Generation ○ And execution!
  37. 37. Cluster Balancing
  38. 38. Future Work
  39. 39. Future Work ● Multi-zone Clusters ○ Durability during DC wide outages ● Chargebacks ● Efficiency Enhancements ○ Intelligent aggregates, automated topic GC etc.. ● uReplicator Enhancements ● Open Source ● And Much More..
  40. 40. More open-source projects at eng.uber.com

×