Fault Tolerance and Job
Recovery in Apache Flink™
Till Rohrmann
trohrmann@apache.org
@stsffap
2
Better be safe than sorry
§  Failures will happen
§  EMC estimated $1.7 billion costs due to
data loss and system downtime
§  Recovery will save you time and costs
§  Switch between algorithms
§  Live upgrade of your system
3
Fault Tolerance
4
Fault tolerance guarantees
§  At most once
•  No guarantees at all
§  At least once
•  For many applications
sufficient
§  Exactly once
§  Flink provides all guarantees
5
Checkpoints
§  Consistent snapshots of distributed data
stream and operator state
6
Barriers
§  Markers for checkpoints
§  Injected in the data flow
7
8
§  Alignment for multi-input operators
Operator State
§  Stateless operators
§  System state
§  User defined state
9
ds.filter(_	!=	0)	
ds.keyBy(0).window(TumblingTimeWindows.of(5,	TimeUnit.SECONDS))	
public	class	CounterSum	implements	RichReduceFunction<Long>	{	
	private	OperatorState<Long>	counter;	
	
	@Override	public	Long	reduce(Long	v1,	Long	v2)	throws	Exception	{	
		counter.update(counter.value()	+	1);	
		return	v1	+	v2;	
	}	
	
	@Override	public	void	open(Configuration	config)	{	
		counter	=	getRuntimeContext().getOperatorState(“counter”,	0L,	false);	
	}	
}
10
11
12
13
Advantages
§  Separation of app logic from recovery
•  Checkpointing interval is just a config
parameter
§  High throughput
•  Controllable checkpointing overhead
§  Low impact on latency
14
15
Cluster High Availability
16
Without high availability
17
JobManager
TaskManager
With high availability
18
JobManager
TaskManager
Stand-by
JobManager
Apache	Zookeeper™	
KEEP	GOING
Persisting jobs
19
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
Job	
1.  Submit	job
Persisting jobs
20
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Submit	job	
2.  Persist	execuAon	graph
Persisting jobs
21
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Submit	job	
2.  Persist	execuAon	graph	
3.  Write	handle	to	ZooKeeper
Persisting jobs
22
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Submit	job	
2.  Persist	execuAon	graph	
3.  Write	handle	to	ZooKeeper	
4.  Deploy	tasks
Handling checkpoints
23
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots
Handling checkpoints
24
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM
Handling checkpoints
25
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM	
4.  Create	global	checkpoint
Handling checkpoints
26
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM	
4.  Create	global	checkpoint	
5.  Persist	global	checkpoint
Handling checkpoints
27
JobManager	
Client	
TaskManagers	
Apache	Zookeeper™	
1.  Take	snapshots	
2.  Persist	snapshots	
3.  Send	handles	to	JM	
4.  Create	global	checkpoint	
5.  Persist	global	checkpoint	
6.  Write	handle	to	ZooKeeper
Conclusion
28
29
30
TL;DL
§  Job recovery mechanism with low latency
and high throughput
§  Exactly one processing semantics
§  No single point of failure
è Flink will always keep processing
your data
31
flink.apache.org
@ApacheFlink

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015