Scaling up uber's real time data analytics

Scaling up Uber’s Real-time Data Analytics
Xiang Fu
James Shao

Agenda
● Use Cases
● Streaming Data Infrastructure
● Streaming Processing Platform
● Streaming Analytics Platform
● Future Work

Stream
Processing
- Driver-Rider Match
- ETA
App Views
Vehicle information
KAFKA
Real-time Driver-Rider Matching

UberEATS - Real-Time Analytics

A bunch more...
● Fraud Detection
● Share My ETA
● Safety
● Etc.

Trillion+ ~PBs
Messages/Day Data Volume
Scale
excluding replication
Tens of Thousands
Topics

Requirements
● High Throughput
● Low Latency for most use cases (<1ms )
● Reliability - At least 99.99%, and 100% for critical use cases
● At-least-once/Cross-DC pipeline for business critical use case
● Multi-Language Support (Go/Java/Python/C++)
● Tens of thousands of simultaneous clients
● Reliable data replication across DC

PRODUCERS
CONSUMERS
Real-time
Analytics, Alerts,
Dashboards
Samza / Flink
Applications
Data Science
Analytics
Reporting
Kafka
Vertica / Hive
Rider App
Driver App
API / Services
Etc.
Ad-hoc Exploration
ELK
Ecosystem @ Uber
Debugging
Hadoop
Payment
Payment
processing
Cassandra
Schemaless
MySQL
DATABASES
AWS S3

Kafka Pipeline
DC2
DC1
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Applications
[ProxyClient]
Kafka REST
Proxy
Regional
Kafka
Local
Agent
Aggregate
Kafka
uReplicator
Offset Sync Service
Aggregate
Kafka
uReplicator

Regular Kafka Data Flow
● Provide 99.99% data durability guarantee, latency < 1ms
● Relies heavily on Batching/buffering
● Cost-effective storage and guarantee performance for critical components
● Target majority of use cases:
○ ETA
○ Logging
○ Business events

At-least-once/XDC Kafka Data flow
● At-least-once Kafka cluster
○ 100% Durability, 10-20 ms produce latency
○ Expensive to operate (harder to scale)
○ Use case: critical business events, driver/rider signup, DB
change-log
● Cross-DC Kafka cluster
○ Cluster consist of machines from multiple Datacenter
○ 100% durability even one DC is gone
○ Most expensive to operate
○ Use case: payments, insurance sign-up, etc

Auditing - Chaperone
● Small embedded client in each layer of Kafka components
● Collect and aggregate data for each Kafka topic
● Provide report on data completeness and latency
● Alert developers if completeness/latency metrics is below SLA

Uber’s Business is Real-Time

Challenges
Infrastructure
● 100s of Billions of
messages/day
● At-least-once
Processing
● Exact-once state
Processing
● 99.99% SLA on
Availability
● 99.99% SLA on
Latency
Operation
● ~200+ Streaming jobs
● Multiple Data Centers
Productivity
● Target Audience
○ Ops
○ Data Scientists
○ Engineers
● Integration
○ Logging
○ Backend Services
○ Storage Systems
○ Data Management
○ Monitoring
○ Reporting

Streaming Job Lifecycle
Job
Resource
Estimation
Streaming
Job Config
Job
Metadata
Config
Job Profiling
Monitoring
and Alerts
Logging
Business
Logic
Deployment Maintenance
Upgrade
All Active/
Failovers
Security
Testing &
Debugging
Blue: Job Specific
Orange: Common Modules
Job Definition
Job Deployment and Maintenance

SQL to be the savior
60-70% of jobs could be
expressed as SQL

AthenaX Approach
Write SQLs to build streaming applications

Why Flink
● Apache Calcite (SQL) Integration
● Easy to manage and scale
● Stateful and fault tolerant
● Accurate (Exactly Once Semantics)
● HDFS integration
● Not dependent on Kafka
● Active Community

Predict the ETD
● Key metric: time to prepare a meal (tprep
)
● Learn a function f: (order status) → tprep
periodically
● Predict the ETD for current orders using f
● AthenaX extracts features for both learnings and predictions

Job Definition
User Defined
Functions
Window based
aggregation
Input Connector Output Connector Environments
Job Resource
Estimation

Job Validation & Resource Estimation
Job Generator
Deployment
WatchDog
Job Validation
Resource
Estimation
UI● Validations
○ SQL Validation
■ Syntax
■ Semantics
○ Input Source Validation
○ Destination Validation
● Resource Estimation
○ Kafka input rate
○ Kafka peak rate
○ Kafka partitions
○ Type of Query
○ Output connector type

Executing AthenaX Applications
Compile SQLs to Flink Job
Job Generator
Deployment
WatchDog
Job Validation
Resource
Estimation
UI● Compilation & Job Generation
■ Compiler: SQL -> Logical plan -> Flink app
■ Optimizer: Flink app -> Optimized Logical
plan -> Physical plan -> Job Graph
SELECT AVG(meal_prep_time) FROM
eats_order
GROUP BY HOP(proctime(),
INTERVAL ‘1’ MINUTE,
INTERVAL ‘15’ MINUTE)
val eats = getEatsOrder()
eats.window(Slide.over(“15.minutes”)
.every(“1.minute”))
.avg(“meal_prep_time”)

AthenaX Deployment
Job Generator
Deployment
WatchDog
Job Validation
Resource
Estimation
UI● Job Data store (Mysql)
○ Job Instances
○ Job Config
○ Instance Config
● Resource Management
○ Isolation
○ Validation
○ Utilization
● Job Promotion
○ Self-Serve Flink on YARN
HDFS

WatchDog
Job Generator
Deployment
WatchDog
Job Validation
Resource
Estimation
UI
Operational Work
● Monitoring and Alerting
● Auto Scaling
○ Organic growth
○ Bounded Resources
increase
● Failover handling
● Failure recovery
100s of jobs - Operational nightmare

Conclusion
● AthenaX: write SQL to build streaming applications
○ Treat table as a generic concept
○ Productivity: development -> production in hours
● The AthenaX Approach
○ SQL on streams as a platform
○ Self-serving production support end-to-end

Agenda
● Use Cases & Scale
● Streaming Data Infrastructure
● Streaming Processing Platform
● Streaming Analytics Platform
● Future Work

Real-Time Analytics Use Cases - Dashboarding
● Target Users:
○ CityOps
○ Executives
● Ingestion latency
○ secs to mins
● Query latency
○ < 1s
● QPS: medium

Use Cases - Adhoc Queries
● Target Users:
○ Data Scientists
○ CityOps
○ mins
● Query latency
○ A few seconds
● QPS: low

Use Cases - Machine Decisions
● Target Users:
○ Applications
○ secs to mins
● Query latency
○ ms
● QPS: high

Challenges
Infrastructure
● 100+TB Storage
● Multi-tenancy
● 99.99% SLA on
availability
● 99.9% SLA on
data accuracy
● ms to sec level
query latency
● sec to min level
ingestion latency
● Geo-spatial query
● GDPR
Accessibility
● Query Language
● Table DDL
● Table SLA
Operation
● 100+ Tables
● Multiple Data Centers
● Schema Evolution
● Data Backfill
Productivity
● Target Audience
○ Ops
○ Data Scientists
○ Engineers
● Integration
○ Data Management
○ Dashboarding
○ Reporting
○ Monitoring

RTA Query
● Adhoc
○ Presto as Federation layer
○ Joins
● Pre-defined
○ Optimization
○ Multi-tenancy
○ Rate-Limiting
○ Caching

Facets of Analytical Data
Data Freshness
Query Latency
Data Retention
Accuracy
Cost
Primary Facets
Secondary Facet

Facets of Analytical Data
Cost
Fresh Data
+
Accuracy
+
High Retention

RTA Storage
● Columnar OLAP
open-sourced by LinkedIn
● Intended for low qps, large
data volume with
low-medium query latency
● Use cases: ad-hoc queries
that are not highly latency
sensitive
● GPU-based analytical database
built in-house
● Intended for high qps, low data
volume with very low query
latency
● Use cases: predefined queries
that are latency sensitive
g DB

RTA UMS (Unified Metadata Service)
Logical Schema
Logical DDL
gForceDB
Pinot Schema
Pinot DDL
gForceDB Schema
gForceDB DDL
● Onboarding
● Query Routing
● Federation

RTA Ingestion
● Leverage Existing AthenaX
Framework
● One SQL for Streaming & Batch

Conclusion
● One onboarding story
○ Unified ingestion pipeline
○ SQL as only query language
○ Hide storage complexity from the end users
● Cost efficiency

Future Work
● Multi-Zone
● Streaming-Batch Unification
● Open Source

Links
Blog Post
● uReplicator: Uber Engineering’s Robust Kafka Replicator
● Introducing Chaperone: How Uber Engineering Audits Kafka End-to-End
● Introducing AthenaX, Uber Engineering’s Open Source Streaming Analytics Platform
● Engineering Restaurant Manager, our UberEATS Analytics Dashboard
Open Source
● Kafka uReplicator open sourced in Aug 2016
● Kafka Chaperone open sourced in Dec 2016
● AthenaX open sourced in Oct 2017

Thank you
More open-source projects at eng.uber.com

Scaling up uber's real time data analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling up uber's real time data analytics

Similar to Scaling up uber's real time data analytics (20)

Recently uploaded

Recently uploaded (20)

Scaling up uber's real time data analytics