AthenaX is Uber's unified stream and batch processing platform built on Apache Flink SQL. It allows users to declaratively express both streaming and batch logic using SQL, enabling real-time analytics and reprocessing of data from data lakes. AthenaX provides self-service tools for composing, deploying, and managing jobs that can scale to over 1,000 production jobs processing over 1 trillion messages per day. Future work includes contributing to Flink's unified catalog and security APIs.
AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber
1. AthenaX: Unified Stream & Batch
Processing using Flink SQL at Uber
Peter Huang
Senior Engineer | Uber Streaming Platform
Feb 21, 2019
2. ● Use Cases
● AthenaX Overview
● System Deepdive
● Future Work
Agenda
3. Feature Computation
Real-time Feature Extraction
● Extract estimated delivery time in
different time dimensions
Offline learning
● Features ingested to data lake for
offline learning
● Online data fit into offline learned
model for better prediction
Use Cases
● Uber Eats, Suge multiplier
4. Dashboarding
Pre Window Aggregation
● Pre aggregation helps to reduce query
time
Push to OLAP
● Realtime data and offline data are
combined to serve queries
Backfill
● Backfill data to OLAP engines to
improve data compression rate.
7. Development Pain Points
Hard to Bootstrap
● Schema and Catalog
● Logic composition
● Integrate with ecosystem
Duplicated Logic
● Stream Logic in Streaming Engines
● Batch Logic in Batch Engines
Hard to maintain
● Needs to evolve on both stream and batch side
9. Challenges for Building Platform
Scale
● > 1 trillion real time
message per day
● 1000+ Streaming jobs
Requirement
● At least one processing
● Exactly once processing
● 99.99% SLA on
Availability
● 99.99% SLA on
subseconds Latency
Audience
● Operation People
● Data Scientist
● Engineers
Integration
● Logging
● Backend services
● OLAP Engines
● Data Management
Development
● Self service
● Backfill support
● Unified Processing
Operation
● Multiple DCs
● Monitoring and Alerts
● Failure Recovery
● Auto Scaling
● DC failover
15. Self Service Job Deployment
● Sandbox
○ Functional Correctness
○ Play with SQL
● Staging
○ System Generated Estimation
○ Production Traffic
● Production
○ Metrics and Alert are Managed
16. Batch Support
● Ondemand Backfill
○ Specify the start time and end end time
○ Handle with buggy code and transient failure of sinks
● Periodical Reprocessing
○ Integrated with external scheduler system
○ Customer provide the schedule period and days of data to reprocessing
○ Periodically reprocessing data from hive with the same SQL
22. Flink Batch Data Locality
● Apache Flink job
manager distributed
split with data location
consideration
● AthenaX backfill job
launches in Apache
Hadoop Cluster
23. Batch Runtime Optimization
Execution Graph
Task
Manager A
Task
Manager B
Source
Source
FlatMap
FlatMap
Sort
Combine
Sort
Combine
Group
Reduce
Group
Reduce
Sink
Sink
Hash-Partition on
Aggregation Key
Intermediate Result
can be reduced
Key1 = a | Key2 = b | 1 | 5 Key1 = a | Key2 = b | 2 | 3
Key1 = a | Key2 = b | 5 | 8
Key1 = a | Key2 = b | 2 | 3
24. Operation Experience
Memory Tuning
● Use less direct memory for internal operation communication
● taskmanager.memory.off-heap: true,
taskmanager.memory.fraction: 0.5 (from experience)
Resource Estimation
● 20K + throughput per Task Manager
● Bounded by the throughput of sink services
Rate Limiting
● SQL with Window aggregation will generate output burst in Sink
● Generic rate limiter interface for all batch table sinks
25. AthenaX
Unified Processing with Flink SQL
● Declaratively specifies business logic.
● Delivers business impact in real-time
● Reprocessing Data from Hive
Self-Service
● Interactive UI for job creation / modification
● Dashboard UI for deployment, workflow
Centralized Management
● Scale to 1,000 + jobs in production
● Operational automation.
Productivity
● Customer onboard to productions takes
hours Standard
SQL
Easy to Use Fully
Managed
Real-time
Auto
Elasticity
27. Future work
● Actively work with community on FLIP-30: Unified Catalog APIs
● Contribute to the Unified APIs for Batch and Stream.
● Contribute to the Flink Security Improvements