Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Fault Tolerance and Job
Recovery in Apache Flink™
Till Rohrmann
trohrmann@apache.org
@stsffap

Better be safe than sorry
§  Failures will happen
§  EMC estimated $1.7 billion costs due to
data loss and system downtime
§  Recovery will save you time and costs
§  Switch between algorithms
§  Live upgrade of your system
3

Fault tolerance guarantees
§  At most once
•  No guarantees at all
§  At least once
•  For many applications
sufﬁcient
§  Exactly once
§  Flink provides all guarantees
5

Checkpoints
§  Consistent snapshots of distributed data
stream and operator state
6

Barriers
§  Markers for checkpoints
§  Injected in the data ﬂow
7

8
§  Alignment for multi-input operators

Operator State
§  Stateless operators
§  System state
§  User deﬁned state
9
ds.filter(_ != 0)
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
public class CounterSum implements RichReduceFunction<Long> {
private OperatorState<Long> counter;

@Override public Long reduce(Long v1, Long v2) throws Exception {
counter.update(counter.value() + 1);
return v1 + v2;
}

@Override public void open(Configuration config) {
counter = getRuntimeContext().getOperatorState(“counter”, 0L, false);
}
}

Advantages
§  Separation of app logic from recovery
•  Checkpointing interval is just a conﬁg
parameter
§  High throughput
•  Controllable checkpointing overhead
§  Low impact on latency
14

Without high availability
17
JobManager
TaskManager

With high availability
18
JobManager
TaskManager
Stand-by
JobManager
Apache Zookeeper™
KEEP GOING

Persisting jobs
19
JobManager
Client
TaskManagers
Apache Zookeeper™
Job
1.  Submit job

Persisting jobs
20
JobManager
Client
TaskManagers
Apache Zookeeper™
1.  Submit job
2.  Persist execuAon graph

Persisting jobs
21
JobManager
Client
TaskManagers
Apache Zookeeper™
1.  Submit job
3.  Write handle to ZooKeeper

Persisting jobs
22
JobManager
Client
TaskManagers
Apache Zookeeper™
1.  Submit job
4.  Deploy tasks

Handling checkpoints
23
JobManager
Client
TaskManagers
Apache Zookeeper™
1.  Take snapshots

24
JobManager
Client
TaskManagers
Apache Zookeeper™
2.  Persist snapshots
3.  Send handles to JM

25
JobManager
Client
TaskManagers
Apache Zookeeper™
4.  Create global checkpoint

26
JobManager
Client
TaskManagers
Apache Zookeeper™
5.  Persist global checkpoint

27
JobManager
Client
TaskManagers
Apache Zookeeper™
5.  Persist global checkpoint

TL;DL
§  Job recovery mechanism with low latency
and high throughput
§  Exactly one processing semantics
§  No single point of failure
è Flink will always keep processing
your data
31

ﬂink.apache.org
@ApacheFlink

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

More Related Content

What's hot

Viewers also liked

Similar to Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

More from Till Rohrmann

Recently uploaded

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015