This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
8. var sum = 0
def map(element):
sum += element
return sum
Stateful Stream Processing
8
...
...
Your
Code
Your
Code
Process
● embedded local state
backend
● State co-partitioned with
the input stream by key
9. State fault tolerance
9
Fault tolerance concerns for a stateful stream processor:
● No guarantees, at-least-once vs exactly-once
● How to ensure exactly-once semantics for the state?
● How to create consistent snapshots of distributed state?
● More importantly, how to do it efficiently without abrupting computation?
14. Apache Flink Stack
14
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
27. Exactly-once
27
● End to end exactly-once requires transactional writes support from
external systems
● Kafka supports transactional writes only since 0.11 version (released in
second half of 2017)
32. 3
2
Take state snapshot
Synchronously trigger
state snapshot (e.g.
copy-on-write)
Flink State and Distributed Snapshots
Stateful
Operation
Source
„Asynchronous Snapshotting“
34. Asynchronous Checkpoints
34
● Minimize pipeline stall time while taking the snapshot
● Keep overhead (memory, CPU,…) as low as possible while writing
the snapshot
● Support multiple parallel checkpoints
35. Incremental checkpointing
35
● Large state with small changes over time
● Efficiently detect the (minimal) set of state changes between two
checkpoints
○ copy on write
● Persist only the difference
○ faster taking snapshots
○ slower recovery (replaying differences)
36. 36
DFS
Local recovery - checkpointing
Stateful
Operation
Source
Copy-on-write
snapshots
Checkpointing state of the operators
38. 38
Local recovery - restoring
Stateful
Operation
Source
Restoring state of the operators
DFS
Recover from by
default from DFS
Recover from local
state if possible
39. Local recovery
39
● Higher cost of asynchronous snapshotting
○ If system is not overloaded do not affect latency/throughput
● Faster recovery
○ Smaller downtime
41. Challenges in Streaming
41
● Streaming is very different from batching and micro-batching
● Providing high throughput with low latency is difficult
45. TL; DR
45
● Stateful stream processing as a paradigm for continuous
data
● Apache Flink is a sophisticated and battle-tested
stateful stream processor with a comprehensive set of
features
● Efficiency, management, and operational issues for state
are taken very seriously