Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Blink for Large Scale Streaming at Alibaba

Runtime Improvements in Blink for Large
Scale Streaming at Alibaba
Feng Wang
Zhijiang Wang
April, 2017

Outline
1 Blink Introduction
2 Improvements to Flink Runtime
3 Future Plans

Blink – Alibaba’s version of Flink
üLooked into Flink two since 2 years ago
• best choice of unified computing engine
• a few of issues in flink that can be problems for large scale applications
üStarted Blink project
• aimed to make Flink work reliably and efficiently at the very large scale at Alibaba
üMade various improvements in Flink runtime
• Runs natively on yarn cluster
• failover optimizations for fast recovery
• incremental checkpoint for large states
• async operator for high throughputs
üWorking with Flink community to contribute changes back since last August
• several key improvements
• hundreds of patches

Blink Ecosystem in Alibaba
Hadoop YARN (Resource Management)
Search
HDFS (Persistent Storage)
Table API
Blink
Alibaba Apps Recommendation BI Security
DataStream API
Runtime Engine
Ads
DataSet API
Machine Learning Platform
SQL

Blink in Alibaba Production
üIn production for almost one year
üRun on thousands of nodes
• hundreds of jobs
• The biggest cluster is more than 1000 nodes
• the biggest job has 10s TB states and thousands of subtasks
üSupported key production services on last Nov 11th, China Single’s Day
• China Single’s Day is by far the biggest shopping holiday in China, similar to Black Friday in US
• Last year it recorded $17.8 billion worth of gross merchandise volumes in one day
• Blink is used to do real time machine learning and increased conversion by around 30%

Blink Architecture
YARN (Resource Management)
HDFS (Persistent Storage)
Blink Runtime
YARN App Master
Resource Manager
Job Manager
Task Manager
Tasks
Rocksdb State Backend
Web Monitor
Flink Client
Alibaba Data Lake
Submit Job
Launch AM Request TM Launch TM
Metrics
Apache Flink Alibaba Blink
Alibaba Monitor SystemMetric Reporter
Connectors
Read/Write
Checkpoint Incrementally
Debug
Task Scheduling
Checkpoint
Coordination

Improvements to Flink Runtime
Section 2

Improvements to Flink Runtime
üNative integration with Resource Management
• Take YARN for an Example
üPerformance Improvements
• Incremental Checkpoint
• Asynchronous Operator
üFailover Optimization
• Fine-grained Recovery for Task Failures
• Allocation Reuse for Task Recovery
• Non-disruptive JobManager failure recovery

Native integration with Resource Management
üBackground
• Cluster resource is allocated upfront. The resource utilization can be not efficient
• A single JobManager handles all the jobs, which limits the scale of the cluster

Native integration with Resource Management
ResourceManager
(Standalone, YARN, Mesos,…)
JobManager (Per Job) TaskManager
1. Request Slot (Resource Spec) 2. Launch
3. Register
4. Request Slot
5. Offer Slot

Native integration with YARN
YARN Cluster
YARN ResourceManager
TaskManagerYARN Application Master
ResourceManager
JobManager TaskManager
Client
2. Submit Job
(JobGraph, Libraries)
3. Launch YARN AM
(JobGraph, Libraries)
5. Start RM
4. Start JM
6. Requests Slots
(Resource Spec)
7. Requests Container
8. Launch TM container
(Libraries)
9. Register
10. Request Slot
11. Offer Slot
1. Execute Job
JobGraph Libraries

Incremental Checkpoint
üBackground
• The job state could be very large (many TBs)
• The state size of individual task can be many GBs
üThe problems of Full Checkpoint
• Materialize all the states to persistent store at each checkpoint
• As the states get bigger, materialization may take too much time to finish
• One of the biggest blocking issues in large scale production
üBenefits of Incremental Checkpoint
• Only the modified states since last checkpoint need to be materialized
• The checkpoint will be faster and more efficient

Task Manager
Incremental Checkpoint – How It Works
Checkpoint Coordinator
State Registry
Delta-A: 2->3
Delta-B: 3->4
Delta-C: 4->5
Delta-D: 5->6
Delta-E: 1
Delta-F: 1
Delta-G: 1
State Snapshot
A B C D E F G
Persistent State Storage
Task
1. Trigger Checkpoint
Pending Checkpoint
5. Increase Delta Reference Count after Checkpoint Complete
4. Send State Handle
3. Async Materialize
2. Take Snapshot
Existed Deltas New Deltas
Already Materialized
Completed Checkpoint
State Handle State Handle
State Handle
6. Decrease Delta Reference Count after Checkpoint Subsume
7. Delete the Delta after the reference count to 0

Incremental Checkpoint - RocksDB State Backend Implementation
RocksDB Instance in TM for KeyGroups[0-m]
.sst .sst .sst
data
1.sst 2.sst 3.sst
chk-1
2.sst 3.sst 4.sst
chk-2
HDFS://user/JobID/OperatorID/KeyGroup_[0-m]
data
3.sst.chk-1 4.sst.chk-21.sst.chk-1 2.sst.chk-1
JobManager
Completed Checkpoint 1
State Handles
Completed Checkpoint 2
State Handles

Asynchronous Operator
üBackground
• Flink task use a single thread to process events
• Flink task sometimes need to access external services (hbase, redis,…)
üProblem
• The high latency may block the event processing
• Throughput can be limited by latency
üBenefits of Asynchronous Operator
• Decouple the throughput of task from the latency of external system
• Improve the CPU utilization of the Flink tasks
• Simplify resource specification for Flink jobs

Asynchronous Operator – How It Works
AsyncOperator
(ordered/unordered)
AsyncQueue (Ordered/UnOrdered)
AsyncFunction
3. asyncInvoke(InputRecord, StreamRecordQueueEntry)
ListenableFunture
1. processElement(InputRecord)
Listener
Emit Thread
11. notify
InputRecord
OutputRecord
External System
4. asyncRequest(request)
9. notify
UserAsyncClient
5. send request
8. asyncResponse
12. emit(OutputRecord)
10. StreamRecordQueueEntry.collect(OutputRecord)
2.add(StreamRecordQueueEntry)
HBase
Cassandra
Redis
… 6. return
StreamRecordQueueEntry
OutputRecordInputRecord
Sync ASync
AsyncCollector
Implements
7. add listener
StreamRecordQueueEntry
OutputRecordInputRecord

Asynchronous Operator – How It Manages State
State Backend
AsyncOperator
AsyncFunction
1. open
2. Get Uncompleted InputRecords
3. asyncInvoke
1. snapshotState
2. Scan
3. Persist Uncompleted InputRecords
AsyncQueue
StreamRecordQueueEntry StreamRecordQueueEntry
OutputRecordInputRecord InputRecordOutputRecord
Task
RecoverySnapshot

Fine-grained Recovery from Task Failures
üStatus of batch job
• Large scale to thousands of nodes
• Node failures are common in large clusters
• Prefer non-pipelined mode due to limited resource
üProblems
• One task failure needs to restart the entire execution graph
• It is especially critical for batch jobs
üBenefit
• Make recovery more efficient by restarting only what needs to be restarted

Fine-grained Recovery from Task Failures – Restart-all Strategy
ExecutionGraph
A1 B1 C1
D1 E1
F1
B2 C2
D2 E2
F2
A2
Caching Result
Pipelined Result
Job Level Commands
FailoverStrategy (Restart-all)
Task Level Failures
A1 B1 C1
D1 E1
F1
B2 C2
D2 E2
F2
A2
(C1 Fails)
Fail, Cancel

FailoverStrategy (Region-based)
FailoverRegion4gion
C2
D2 E2
F2
FailoverRegion3gion
B2A2
FailoverRegion1n
B1A1
FailoverRegion2gion
C1
D1 E1
F1
Fine-grained Recovery from Task Failures – Region-based Strategy
ExecutionGraph A1 B1 C1
D
1
E1
F1
B2 C2
D
2
E2
F2
A2
FailoverStrategy (Region-based)
FailoverRegion3gion
C2
D2 E2
F2
FailoverRegion1n
B1A1
FailoverRegion3gion
B2A2
FailoverRegion4ion
C1
D1 E1
F1
Task Failures
C1 Fails
FailoverRegion2gion
C1
D1 E1
F1
Data Failures
Intermediate Result Produced
by B1 Is corrupt
FailoverRegion1n
B1A1

Allocation Reuse for Task Recovery
üBackground
• The job state can be very big in Alibaba, hence use RocksDB as state backend
• State restore by RocksDB backend involves in copying data from HDFS
üProblem
• It is expensive to restore state from HDFS during task recovery
üBenefits of Allocation Reuse
• Deploy the restarted task in previous allocation to speed up recovery
• Restore state from local RocksDB to avoid copying data from HDFS

ExecutionVertex
Execution
(Attempt = 1)
JobManager
Allocation Reuse for Task Recovery – How It Works
Prior Execution List
Execution
(Attempt = 1)
4. Add Execution
Execution
(Attempt = 2)
5. Get Preferred Location
6. Allocate Slot
(TMLocation, ResourceProfile)
SlotPool
1. Return Allocated Slot
AllocatedSlotgion
2. Remove Slot
AvailableSlotgion
Map<TMLocation, Set<AllocatedSlot>>
3. Add Allocated Slot
7. Match & Return Slot

Non-disruptive JobManager Failures via Reconciliation
üBackground
• The job is large scale to thousands of nodes
• The job state can be very big in TB level
üProblems
• All the tasks are restarted for job manager failures
• The cost is high for recovering states
üBenefit
• Improve job stability by automatic reconciliation to avoid restarting tasks

JobManager
ExecutionGraph
reconcile
RECONCILING
Non-disruptive JobManager Failures via Reconciliation – How It works
RunningJobsRegistry
isJobRunning = true
RUNNING|FINISHED|FAILING
State Transition ExecutionGraph
Timer
JobManagerRunner
Leader Change
Register & Report Task Slot StatusRegister & Offer Slots
TaskManager
Slot
Task
Slot
Task
TaskManager
Slot
Task
Slot
Task
JobManager
ExecutionGraph
JobManagerRunner
Schedule
RunningJobsRegistry
isJobRunning = false
Schedule

Future Plans
üBlink is already popular in the streaming scenarios
• more and more streaming applications will run on blink
üMake batch applications run on production
• increase the resource utilization of the clusters
üBlink as Service
• Alibaba Group Wide
üCluster is growing very fast
• cluster size will double
• thousands of jobs run on production

Thanks
We are Hiring!
aliserach-recruiting@list.alibaba-inc.com

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Blink for Large Scale Streaming at Alibaba

More Related Content

What's hot

Similar to Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Blink for Large Scale Streaming at Alibaba

More from Flink Forward

Recently uploaded

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Blink for Large Scale Streaming at Alibaba