AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber

AthenaX: Unified Stream & Batch
Processing using Flink SQL at Uber
Peter Huang
Senior Engineer | Uber Streaming Platform
Feb 21, 2019

● Use Cases
● AthenaX Overview
● System Deepdive
● Future Work
Agenda

Feature Computation
Real-time Feature Extraction
● Extract estimated delivery time in
different time dimensions
Offline learning
● Features ingested to data lake for
offline learning
● Online data fit into offline learned
model for better prediction
Use Cases
● Uber Eats, Suge multiplier

Dashboarding
Pre Window Aggregation
● Pre aggregation helps to reduce query
time
Push to OLAP
● Realtime data and offline data are
combined to serve queries
Backfill
● Backfill data to OLAP engines to
improve data compression rate.

Development Pain Points
Hard to Bootstrap
● Schema and Catalog
● Logic composition
● Integrate with ecosystem
Duplicated Logic
● Stream Logic in Streaming Engines
● Batch Logic in Batch Engines
Hard to maintain
● Needs to evolve on both stream and batch side

Job from idea to
production takes days

Challenges for Building Platform
Scale
● > 1 trillion real time
message per day
● 1000+ Streaming jobs
Requirement
● At least one processing
● Exactly once processing
● 99.99% SLA on
Availability
● 99.99% SLA on
subseconds Latency
Audience
● Operation People
● Data Scientist
● Engineers
Integration
● Logging
● Backend services
● OLAP Engines
● Data Management
Development
● Self service
● Backfill support
● Unified Processing
Operation
● Multiple DCs
● Monitoring and Alerts
● Failure Recovery
● Auto Scaling
● DC failover

What’s the AthenaX
● SQL
● Unified Processing
● Self Service

Why SQL
● Declaratively express business
logic
● Common analytics use cases
supported
● Widely adopted & easy-to-learn,
easy-to-comprehend
● Apache Flink provides Unified
Stream & Batch processing with
SQL
Job Classification 2016

AthenaX Unified Processing
HTTP
Pinot
Apache
Kafka
Apache
Cassandra
Apache
Kafka
Apache
Hive
Flink as Unified Engine
Streaming
Batch
Streaming/
Batch
Memsql
SELECT * FROM (
SELECT EXPECTED_TIME(meal_id) AS
e, AVG(meal_prep_time) AS t
FROM eats_order
GROUP BY meal_id, HOP(proctime(),
INTERVAL ‘1’ MINUTE,
INTERVAL ‘15’ MINUTE)
) WHERE e - t > 600

How to enable Self Service
● Job Composition
● Job Deployment
● Batch Support
● Job Runtime Management

Self Service Job Deployment
● Sandbox
○ Functional Correctness
○ Play with SQL
● Staging
○ System Generated Estimation
○ Production Traffic
● Production
○ Metrics and Alert are Managed

Batch Support
● Ondemand Backfill
○ Specify the start time and end end time
○ Handle with buggy code and transient failure of sinks
● Periodical Reprocessing
○ Integrated with external scheduler system
○ Customer provide the schedule period and days of data to reprocessing
○ Periodically reprocessing data from hive with the same SQL

Scalable Job Management
● Monitoring and Alert Automation
● Auto-Scaling
● Job Recovery
● DC Failover
WatchDog

SQL
Job Graph
Runtime
Apache
Flink-Yarn
Apache
Flink-HDFS
Engine
Stream
App
Batch
App
YARN
I/O Config
AthenaX Systems
User Logic
AthenaX
Deployment
HMS
Compilation
Resource Mgr
UI/UX
System
Monitoring
Schema Mgr

Flink Batch Data Locality
● Apache Flink job
manager distributed
split with data location
consideration
● AthenaX backfill job
launches in Apache
Hadoop Cluster

Operation Experience
Memory Tuning
● Use less direct memory for internal operation communication
● taskmanager.memory.off-heap: true,
taskmanager.memory.fraction: 0.5 (from experience)
Resource Estimation
● 20K + throughput per Task Manager
● Bounded by the throughput of sink services
Rate Limiting
● SQL with Window aggregation will generate output burst in Sink
● Generic rate limiter interface for all batch table sinks

AthenaX
Unified Processing with Flink SQL
● Declaratively specifies business logic.
● Delivers business impact in real-time
● Reprocessing Data from Hive
Self-Service
● Interactive UI for job creation / modification
● Dashboard UI for deployment, workflow
Centralized Management
● Scale to 1,000 + jobs in production
● Operational automation.
Productivity
● Customer onboard to productions takes
hours Standard
SQL
Easy to Use Fully
Managed
Real-time
Auto
Elasticity

Team
Naveen Cherukuri
Qi Chen Roshan Naik
Hanghang Liu
Rong Rong
Peter Huang
Sanath Muppalla
Alice Yan
Shuyi Chen

Future work
● Actively work with community on FLIP-30: Unified Catalog APIs
● Contribute to the Unified APIs for Batch and Stream.
● Contribute to the Flink Security Improvements

AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber

AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber

Similar to AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber (20)

More from Bowen Li

More from Bowen Li (14)

Recently uploaded

Recently uploaded (20)

AthenaX: Unified Stream & Batch Processing using Flink SQL at Uber