Runtime	Improvements	in	Blink	for	Large	
Scale	Streaming	at	Alibaba	
Feng	Wang
Zhijiang Wang
April,	2017
Outline
1 Blink Introduction
2 Improvements to Flink Runtime
3 Future Plans
Blink Introduction
Section	1
Blink – Alibaba’s version of Flink
üLooked into Flink two since 2 years ago
• best choice of unified computing engine
• a few of issues in	flink that can be problems for large scale applications
üStarted Blink project
• aimed to make	Flink work	reliably	and	efficiently	at	the	very	large	scale	at	Alibaba
üMade various improvements in Flink runtime
• Runs natively on yarn cluster
• failover optimizations for fast recovery
• incremental checkpoint for large states
• async operator	for high throughputs
üWorking	with	Flink community	to	contribute changes back since	last August
• several key improvements
• hundreds of patches
Blink	Ecosystem in Alibaba
Hadoop YARN (Resource Management)
Search
HDFS	(Persistent Storage)
Table API
Blink
Alibaba Apps Recommendation BI Security
DataStream API
Runtime Engine
Ads
DataSet API
Machine Learning Platform
SQL
Blink in Alibaba Production
üIn production for almost one year
üRun on thousands of nodes
• hundreds of jobs
• The biggest cluster is more than 1000 nodes
• the biggest job has 10s TB states and thousands of subtasks
üSupported key production services on last Nov 11th, China Single’s Day
• China Single’s Day is by far the biggest shopping holiday in China, similar to Black Friday in US
• Last year it recorded $17.8 billion worth of gross merchandise volumes in one day
• Blink is used to do real time machine learning and increased conversion by around 30%
Blink	Architecture
YARN (Resource Management)
HDFS (Persistent Storage)
Blink Runtime
YARN App	Master
Resource Manager
Job Manager
Task Manager
Tasks
Rocksdb State Backend
Web Monitor
Flink Client
Alibaba Data Lake
Submit Job
Launch	AM Request TM Launch	TM
Metrics
Apache Flink Alibaba Blink
Alibaba Monitor	SystemMetric Reporter
Connectors
Read/Write
Checkpoint Incrementally
Debug
Task Scheduling
Checkpoint
Coordination
Improvements to Flink Runtime
Section	2
Improvements to Flink Runtime
üNative integration with Resource Management
• Take YARN for an Example
üPerformance Improvements
• Incremental Checkpoint
• Asynchronous Operator
üFailover Optimization
• Fine-grained Recovery	for	Task	Failures
• Allocation Reuse	for	Task	Recovery
• Non-disruptive	JobManager failure	recovery
Native integration with Resource Management
üBackground
• Cluster resource is allocated upfront. The resource utilization can be	not	efficient
• A single JobManager handles all the jobs, which limits the scale of the cluster
Native integration with Resource Management
ResourceManager
(Standalone, YARN, Mesos,…)
JobManager (Per Job) TaskManager
1. Request Slot (Resource Spec) 2. Launch
3. Register
4. Request Slot
5. Offer Slot
Native integration with YARN
YARN Cluster
YARN ResourceManager
TaskManagerYARN Application Master
ResourceManager
JobManager TaskManager
Client
2. Submit Job
(JobGraph, Libraries)
3. Launch YARN AM
(JobGraph, Libraries)
5. Start RM
4. Start JM
6. Requests Slots
(Resource Spec)
7. Requests Container
8. Launch TM container
(Libraries)
9. Register
10. Request Slot
11. Offer Slot
1. Execute Job
JobGraph Libraries
Incremental Checkpoint
üBackground
• The job state could be	very	large	(many TBs)
• The	state	size	of	individual	task	can be many GBs
üThe problems of Full Checkpoint
• Materialize	all	the	states	to	persistent	store	at	each	checkpoint
• As	the	states	get	bigger,	materialization may	take	too much time	to	finish
• One of the biggest blocking issues in large scale production
üBenefits of Incremental Checkpoint
• Only the modified states since last checkpoint need	to	be materialized
• The checkpoint will be faster	and more efficient
Task Manager
Incremental Checkpoint – How It	Works
Checkpoint Coordinator
State Registry
Delta-A:	2->3
Delta-B:	3->4
Delta-C:	4->5
Delta-D:	5->6
Delta-E:	1
Delta-F:	1
Delta-G: 1
State Snapshot
A B C D E F G
Persistent	 State Storage
Task
1. Trigger Checkpoint
Pending Checkpoint
5. Increase Delta Reference Count after Checkpoint Complete
4. Send State Handle
3. Async Materialize
2. Take Snapshot
Existed Deltas New Deltas
Already Materialized
Completed Checkpoint
State Handle State Handle
State Handle
6. Decrease Delta Reference Count after Checkpoint Subsume
7. Delete the Delta after the reference count to 0
Incremental Checkpoint - RocksDB State Backend Implementation
RocksDB Instance in TM for KeyGroups[0-m]
.sst .sst .sst
data
1.sst 2.sst 3.sst
chk-1
2.sst 3.sst 4.sst
chk-2
HDFS://user/JobID/OperatorID/KeyGroup_[0-m]
data
3.sst.chk-1 4.sst.chk-21.sst.chk-1 2.sst.chk-1
JobManager
Completed Checkpoint 1
State Handles
Completed Checkpoint 2
State Handles
Asynchronous Operator
üBackground
• Flink task use a single	thread	to process events
• Flink task	sometimes need to access	external	services (hbase, redis,…)
üProblem
• The high latency may block the event processing
• Throughput can be limited by latency
üBenefits of Asynchronous Operator
• Decouple the	throughput	of	task	from the latency	of	external	system
• Improve the CPU utilization	of	the Flink tasks
• Simplify resource specification for Flink jobs
Asynchronous Operator	– How	It	Works
AsyncOperator
(ordered/unordered)
AsyncQueue (Ordered/UnOrdered)
AsyncFunction
3. asyncInvoke(InputRecord, StreamRecordQueueEntry)
ListenableFunture
1. processElement(InputRecord)
Listener
Emit Thread
11. notify
InputRecord
OutputRecord
External	System
4. asyncRequest(request)
9. notify
UserAsyncClient
5. send request
8. asyncResponse
12. emit(OutputRecord)
10. StreamRecordQueueEntry.collect(OutputRecord)
2.add(StreamRecordQueueEntry)
HBase
Cassandra
Redis
… 6. return
StreamRecordQueueEntry
OutputRecordInputRecord
Sync ASync
AsyncCollector
Implements
7. add listener
StreamRecordQueueEntry
OutputRecordInputRecord
Asynchronous Operator	– How It	Manages State
State Backend
AsyncOperator
AsyncFunction
1. open
2. Get Uncompleted InputRecords
3. asyncInvoke
1. snapshotState
2. Scan
3. Persist Uncompleted InputRecords
AsyncQueue
StreamRecordQueueEntry StreamRecordQueueEntry
OutputRecordInputRecord InputRecordOutputRecord
Task
RecoverySnapshot
Fine-grained	Recovery	from	Task	Failures
üStatus	of	batch	job
• Large	scale	to	thousands	of	nodes
• Node	failures	are	common	in	large	clusters
• Prefer	non-pipelined	mode	due	to	limited	resource
üProblems	
• One	task	failure	needs	to	restart	the	entire	execution	graph	
• It	is	especially	critical	for	batch	jobs
üBenefit
• Make	recovery	more	efficient	by	restarting	only	what	needs	to	be	restarted
Fine-grained	Recovery	from	Task	Failures	– Restart-all	Strategy
ExecutionGraph
A1 B1 C1
D1 E1
F1
B2 C2
D2 E2
F2
A2
Caching	Result
Pipelined	Result
Job	Level	Commands	
FailoverStrategy (Restart-all)
Task	Level	Failures
A1 B1 C1
D1 E1
F1
B2 C2
D2 E2
F2
A2
(C1	Fails)
Fail,	Cancel
FailoverStrategy (Region-based)
FailoverRegion4gion
C2
D2 E2
F2
FailoverRegion3gion
B2A2
FailoverRegion1n
B1A1
FailoverRegion2gion
C1
D1 E1
F1
Fine-grained	Recovery	from	Task	Failures	– Region-based	Strategy
ExecutionGraph A1 B1 C1
D
1
E1
F1
B2 C2
D
2
E2
F2
A2
FailoverStrategy (Region-based)
FailoverRegion3gion
C2
D2 E2
F2
FailoverRegion1n
B1A1
FailoverRegion3gion
B2A2
FailoverRegion4ion
C1
D1 E1
F1
Task	Failures
C1 Fails
FailoverRegion2gion
C1
D1 E1
F1
Data	Failures
Intermediate Result Produced
by B1 Is corrupt
FailoverRegion1n
B1A1
Allocation	Reuse	for	Task	Recovery
üBackground
• The	job	state	can	be	very	big	in	Alibaba,	hence	use	RocksDB as	state	backend
• State	restore	by	RocksDB backend	involves	in	copying	data	from	HDFS
üProblem
• It	is	expensive	to	restore	state	from	HDFS	during	task	recovery
üBenefits	of	Allocation	Reuse
• Deploy	the	restarted	task	in	previous	allocation	to	speed	up	recovery
• Restore	state	from	local	RocksDB to	avoid	copying	data	from	HDFS
ExecutionVertex
Execution
(Attempt	=	1)
JobManager
Allocation	Reuse	for	Task	Recovery	– How	It	Works
Prior	Execution	List
Execution
(Attempt	=	1)
4.	Add	Execution
Execution
(Attempt	=	2)
5.	Get	Preferred	Location
6.	Allocate		Slot
(TMLocation,	ResourceProfile)
SlotPool
1.	Return	Allocated	Slot
AllocatedSlotgion
2.	Remove	Slot
AvailableSlotgion
Map<TMLocation,	Set<AllocatedSlot>>
3.	Add	Allocated	Slot
7.	Match	&	Return	Slot
Non-disruptive	JobManager	Failures	via	Reconciliation	
üBackground
• The	job	is	large	scale	to	thousands	of	nodes
• The	job	state	can	be	very	big	in	TB	level
üProblems
• All	the	tasks	are	restarted	for	job	manager	failures
• The	cost	is	high	for	recovering	states
üBenefit
• Improve	job	stability	by	automatic	reconciliation	to	avoid	restarting	tasks
JobManager
ExecutionGraph
reconcile
RECONCILING
Non-disruptive	JobManager	Failures	via	Reconciliation	– How	It	works
RunningJobsRegistry
isJobRunning =	true
RUNNING|FINISHED|FAILING
State	Transition ExecutionGraph
Timer
JobManagerRunner
Leader	Change
Register	&	Report	Task	Slot	StatusRegister	&	Offer	Slots
TaskManager
Slot
Task
Slot
Task
TaskManager
Slot
Task
Slot
Task
JobManager
ExecutionGraph
JobManagerRunner
Schedule
RunningJobsRegistry
isJobRunning =	false
Schedule
Future Plans
Section	3
Future Plans
üBlink	is	already	popular	in	the	streaming	scenarios
• more and more streaming applications will run on blink
üMake	batch applications	run on production
• increase	the	resource	utilization	of	the	clusters
üBlink as Service
• Alibaba	Group	Wide
üCluster	is	growing	very	fast
• cluster	size will double
• thousands	of	jobs run on production
Thanks
We	are	Hiring!
aliserach-recruiting@list.alibaba-inc.com

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Blink for Large Scale Streaming at Alibaba