Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
G...
Overview
• The Apache Flink System Architecture
• Pipelined Consistent Snapshots
• Operations with Snapshots
• Large Scale...
The Apache Flink
Framework
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Co...
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core...
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core...
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core...
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core...
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStre...
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStre...
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStre...
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStre...
1. End-to-End
Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots
5
1. End-to-End
Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots
6
Stateful Processing
tasktasktask
7
Stateful Processing
tasktasktask
invoke per
input record
7
Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
7
Local
State Backend
physical
operations
In-Memory(Heap)
Embedded Off-heap+Disk
Key-Value Store
(RocksDB)
Stateful Processi...
Local
State Backend
physical
operations
In-Memory(Heap)
Embedded Off-heap+Disk
Key-Value Store
(RocksDB)
Stateful Processi...
8
local
statesinput
streams
8
local
statesinput
streams
stream
processor
8
local
statesinput
streams
divide computation
into epochs
stream
processor
8
local
statesinput
streams
capture all local
states after
completing an
epoch
divide computation
into epochs
stream
process...
local
statesinput
streams
capture all local
states after
completing an
epoch
divide computation
into epochs
stream
process...
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot
Store
copy states
A Synchronous Approach
master
9
• In use: Storm Trident and Spark Streaming
• A conservative approach, equivalent to batching
• Can cause unnecessary late...
Pipelined Snapshots
Snapshot
Store
async state copy
11
Pipelined Snapshots
Snapshot
Store
async state copy
insert markers
11
Pipelined Snapshots
Snapshot
Store
async state copy
insert markers
A
B
C
D
E
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
11
Pipelined Snapshots
Snapshot
Store
async state copy
epoch alignment
A
B
C
D
E
B
11
Pipelined Snapshots
Snapshot
Store
async state copy
epoch alignment
A
B
C
D
E
B
A
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
A
C
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
A
C
D
E
11
Pipelined Snapshots
Snapshot
Store
async state copy
snapshot
completes
A
B
C
D
E
B
A
C
D
E
11
Pipelined Snapshots (cycles)
12
Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
12
Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
Solution: log in
snapshot inflight
r...
• Offers exactly-once processing guarantees
• Issued periodically/externally by the user
• Naturally respects flow control ...
1. End-to-End
Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
14
Exactly-Once: Input and Processing
Important Assumptions
• Input streams are persisted with offset indexes (e.g., Kafka, K...
• Idempontency ~ repeated operations can be tolerated after
recovery/rollback (works for mutable stores).
• Transactional ...
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End
Guarantees
17
Dataflow Reconfiguration
18
Dataflow Reconfiguration
18
Dataflow Reconfiguration
stop
snap-1 snap-2
18
Dataflow Reconfiguration
stop
snap-1 snap-2
snap-3
…
change
parallelism
18
Dataflow Reconfiguration
stop
snap-1 snap-2
snap-3
…
change
parallelism
Problem: How is state repartitioned from a snapshot?...
Reconfiguration: The Issue
19
Reconfiguration: The Issue
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible ...
Reconfiguration: The Issue
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible ...
Reconfiguration: The Issue
case II
0x100: bob
…
…
…
…
0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bo...
Reconfiguration: The Issue
case II
0x100: bob
…
…
…
…
0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bo...
Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
alice
20
Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
• Snapshot Metadata:
Conta...
Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
• Snapshot Metadata:
Conta...
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End
Guarantees
21
Version Control
22
Version Control
Pipeline v.1
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.2
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.2
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.3
Pipeline v.2
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.3
Pipeline v.2
22
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End
Guarantees
23
Isolation Levels
24
Isolation Levels
select from facebook.userID, clients.name …
inner join clients on …
read-committed
(snapshot)
read-uncomm...
Large Scale Deployment at King
25
Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
t...
Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
t...
Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT...
Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT...
Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT...
Teaser: More paper
highlights
• We can use the same technique to coordinate
externally managed state with snapshots.
• Epo...
Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
G...
Upcoming SlideShare
Loading in …5
×

State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

1,607 views

Published on

An overview of state management techniques employed in Apache Flink including pipelined consistent snapshots and intuitive usages for reconfiguration, which were presented at vldb 2017.

Published in: Data & Analytics
  • Be the first to comment

State Management in Apache Flink : Consistent Stateful Distributed Stream Processing

  1. 1. Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology Stephan Ewen<stephan@data-artisans.com> - data Artisans Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology Stefan Richter<s.richter@data-artisans.com> - data Artisans Kostas Tzoumas<kostas@data-artisans.com> - data Artisans 1 State Management in Apache Flink® Consistent Stateful Distributed Stream Processing @vldb17
  2. 2. Overview • The Apache Flink System Architecture • Pipelined Consistent Snapshots • Operations with Snapshots • Large Scale Deployments and Evaluation 2
  3. 3. The Apache Flink Framework Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup 3
  4. 4. Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. Client 4
  5. 5. Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. Client 4
  6. 6. Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. Client optimised logical graph 4
  7. 7. Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. • scheduling • state partitioning • snapshot coordination Client optimised logical graph 4
  8. 8. Zookeeper • passive failover • snapshot metadata Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. • scheduling • state partitioning • snapshot coordination Client optimised logical graph 4
  9. 9. Zookeeper • passive failover • snapshot metadata Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. • scheduling • state partitioning • snapshot coordination Client optimised logical graph • memory management • local snapshot execution • flow control physical long-running tasks 4
  10. 10. Zookeeper • passive failover • snapshot metadata Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. • scheduling • state partitioning • snapshot coordination Client optimised logical graph • memory management • local snapshot execution • flow control physical long-running tasks locally managed state 4
  11. 11. Zookeeper • passive failover • snapshot metadata Distributed Architecture Cluster Backend Metrics Dataflow Runtime DataStream DataSet SQL Table CEP Graphs ML Libraries Core API Runner Setup Job Manager Task Manager Task Manager …. • scheduling • state partitioning • snapshot coordination Client optimised logical graph • memory management • local snapshot execution • flow control physical long-running tasks locally managed state External Snapshot Store (e.g., hdfs) partial snapshots 4
  12. 12. 1. End-to-End Guarantees Snapshots 2. Reconfiguration 3. Version Control 4. Isolation Snapshots 5
  13. 13. 1. End-to-End Guarantees Snapshots 2. Reconfiguration 3. Version Control 4. Isolation Snapshots 6
  14. 14. Stateful Processing tasktasktask 7
  15. 15. Stateful Processing tasktasktask invoke per input record 7
  16. 16. Stateful Processing tasktasktask readwrite managed state logical operations (collections) invoke per input record 7
  17. 17. Local State Backend physical operations In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store (RocksDB) Stateful Processing tasktasktask readwrite managed state logical operations (collections) invoke per input record 7
  18. 18. Local State Backend physical operations In-Memory(Heap) Embedded Off-heap+Disk Key-Value Store (RocksDB) Stateful Processing tasktasktask readwrite managed state logical operations (collections) invoke per input record state = f(input) 7
  19. 19. 8
  20. 20. local statesinput streams 8
  21. 21. local statesinput streams stream processor 8
  22. 22. local statesinput streams divide computation into epochs stream processor 8
  23. 23. local statesinput streams capture all local states after completing an epoch divide computation into epochs stream processor 8
  24. 24. local statesinput streams capture all local states after completing an epoch divide computation into epochs stream processor can rollback input and state to captured point in the past 8
  25. 25. Snapshot Store copy states A Synchronous Approach master 9
  26. 26. drain epoch 1 Snapshot Store copy states A Synchronous Approach master 9
  27. 27. drain epoch 1 Snapshot Store copy states A Synchronous Approach master 9
  28. 28. drain epoch 1 Snapshot Store copy states A Synchronous Approach master 9
  29. 29. drain epoch 2 Snapshot Store copy states A Synchronous Approach master 9
  30. 30. drain epoch 2 Snapshot Store copy states A Synchronous Approach master 9
  31. 31. drain epoch 2 Snapshot Store copy states A Synchronous Approach master 9
  32. 32. • In use: Storm Trident and Spark Streaming • A conservative approach, equivalent to batching • Can cause unnecessary latency (master coordination) • Processing is no longer continuous • Forces many tasks to be idle • Instead, in Apache Flink snapshots are pipelined Synchronous Snapshots 10
  33. 33. Pipelined Snapshots Snapshot Store async state copy 11
  34. 34. Pipelined Snapshots Snapshot Store async state copy insert markers 11
  35. 35. Pipelined Snapshots Snapshot Store async state copy insert markers A B C D E 11
  36. 36. Pipelined Snapshots Snapshot Store async state copy A B C D E 11
  37. 37. Pipelined Snapshots Snapshot Store async state copy A B C D E B 11
  38. 38. Pipelined Snapshots Snapshot Store async state copy epoch alignment A B C D E B 11
  39. 39. Pipelined Snapshots Snapshot Store async state copy epoch alignment A B C D E B A 11
  40. 40. Pipelined Snapshots Snapshot Store async state copy A B C D E B A C 11
  41. 41. Pipelined Snapshots Snapshot Store async state copy A B C D E B A C D E 11
  42. 42. Pipelined Snapshots Snapshot Store async state copy snapshot completes A B C D E B A C D E 11
  43. 43. Pipelined Snapshots (cycles) 12
  44. 44. Pipelined Snapshots (cycles) Problem: we cannot wait indefinitely for records in cycles 12
  45. 45. Pipelined Snapshots (cycles) Problem: we cannot wait indefinitely for records in cycles Solution: log in snapshot inflight records within a cycle Replay upon recovery. 12
  46. 46. • Offers exactly-once processing guarantees • Issued periodically/externally by the user • Naturally respects flow control mechanisms • Channel state logging limited to cycles only • Multiple epoch snapshots can be pipelined • Can offer weaker at-least-once processing guarantees by simply dropping aligning vs no alignment cost Technique Highlights 13
  47. 47. 1. End-to-End Guarantees Snapshots 2. Reconfiguration 3. Version Control 4. Isolation Snapshots Usages 14
  48. 48. Exactly-Once: Input and Processing Important Assumptions • Input streams are persisted with offset indexes (e.g., Kafka, Kinesis) • Data Channels are FIFO and reliable (no loss) Each epoch either completes or repeats 15
  49. 49. • Idempontency ~ repeated operations can be tolerated after recovery/rollback (works for mutable stores). • Transactional Processing ~ Requires a two-phase coordination. A snapshot completion eventually leads to external commit (e.g., Flink’s HDFS RollingSink*) in-progress committedpendingpending epoch n-1 epoch n-2 epoch n-3epoch n Exactly-Once Output 16
  50. 50. Snapshots 2. Reconfiguration 3. Version Control 4. Isolation Snapshots Usages 1. End-to-End Guarantees 17
  51. 51. Dataflow Reconfiguration 18
  52. 52. Dataflow Reconfiguration 18
  53. 53. Dataflow Reconfiguration stop snap-1 snap-2 18
  54. 54. Dataflow Reconfiguration stop snap-1 snap-2 snap-3 … change parallelism 18
  55. 55. Dataflow Reconfiguration stop snap-1 snap-2 snap-3 … change parallelism Problem: How is state repartitioned from a snapshot? 18
  56. 56. Reconfiguration: The Issue 19
  57. 57. Reconfiguration: The Issue 0x100: bob … … … … 0x449: alice reconfigure case I full scan Scan Remote Storage for Responsible Keys 19
  58. 58. Reconfiguration: The Issue 0x100: bob … … … … 0x449: alice reconfigure case I full scan Scan Remote Storage for Responsible Keys too slow 19
  59. 59. Reconfiguration: The Issue case II 0x100: bob … … … … 0x449: alice reconfigure Include Key Locations in Snapshot Metadata bob: 0x100 carol: 0x344 … alice: 0x449 chuck: 0x630 … 0x100: bob … … … … 0x449: alice reconfigure case I full scan Scan Remote Storage for Responsible Keys too slow 19
  60. 60. Reconfiguration: The Issue case II 0x100: bob … … … … 0x449: alice reconfigure Include Key Locations in Snapshot Metadata bob: 0x100 carol: 0x344 … alice: 0x449 chuck: 0x630 … 0x100: bob … … … … 0x449: alice reconfigure case I full scan Scan Remote Storage for Responsible Keys too slow too much 19
  61. 61. Reconfiguration: Key Groups Pre-partition state in hash(K) space, into key-groups bob… … … … … … alice 20
  62. 62. Reconfiguration: Key Groups Pre-partition state in hash(K) space, into key-groups bob… … … … … … • Snapshot Metadata: Contains a reference per stored Key-Group (less metadata) • Reconfiguration: Contiguous key-group allocation to available tasks (less IO) alice 20
  63. 63. Reconfiguration: Key Groups Pre-partition state in hash(K) space, into key-groups bob… … … … … … • Snapshot Metadata: Contains a reference per stored Key-Group (less metadata) • Reconfiguration: Contiguous key-group allocation to available tasks (less IO) alice Note: number of key groups controls trade-off between metadata to keep and reconfiguration speed 20
  64. 64. Snapshots 2. Reconfiguration 3. Version Control 4. Isolation Snapshots Usages 1. End-to-End Guarantees 21
  65. 65. Version Control 22
  66. 66. Version Control Pipeline v.1 22
  67. 67. Version Control fork and update Pipeline v.1 Pipeline v.2 22
  68. 68. Version Control fork and update Pipeline v.1 Pipeline v.2 22
  69. 69. Version Control fork and update Pipeline v.1 Pipeline v.3 Pipeline v.2 22
  70. 70. Version Control fork and update Pipeline v.1 Pipeline v.3 Pipeline v.2 22
  71. 71. Snapshots 2. Reconfiguration 3. Version Control 4. Isolation Snapshots Usages 1. End-to-End Guarantees 23
  72. 72. Isolation Levels 24
  73. 73. Isolation Levels select from facebook.userID, clients.name … inner join clients on … read-committed (snapshot) read-uncommitted (dirty read on latest state) external query 24
  74. 74. Large Scale Deployment at King 25
  75. 75. Large Scale Deployment at King100 200 300 400 500 Global State Size (GB) 0 50 100 150 200 250 TotalSnapshottingTime(sec) total time / snapshot (alignment + async copies) 25
  76. 76. Large Scale Deployment at King100 200 300 400 500 Global State Size (GB) 0 50 100 150 200 250 TotalSnapshottingTime(sec) total time / snapshot (alignment + async copies) ~runtime overhead 25
  77. 77. Large Scale Deployment at King 30 50 70 Parallelism 0 200 400 600 800 1000 1200 1400 TotalAlignmentTime(msec) PROC WIN OUT alignment cost 100 200 300 400 500 Global State Size (GB) 0 50 100 150 200 250 TotalSnapshottingTime(sec) total time / snapshot (alignment + async copies) ~runtime overhead 25
  78. 78. Large Scale Deployment at King 30 50 70 Parallelism 0 200 400 600 800 1000 1200 1400 TotalAlignmentTime(msec) PROC WIN OUT alignment cost 100 200 300 400 500 Global State Size (GB) 0 50 100 150 200 250 TotalSnapshottingTime(sec) total time / snapshot (alignment + async copies) ~runtime overhead 25
  79. 79. Large Scale Deployment at King 30 50 70 Parallelism 0 200 400 600 800 1000 1200 1400 TotalAlignmentTime(msec) PROC WIN OUT alignment cost 100 200 300 400 500 Global State Size (GB) 0 50 100 150 200 250 TotalSnapshottingTime(sec) total time / snapshot (alignment + async copies) ~runtime overhead • #shuffles (keyby) • parallelism 25
  80. 80. Teaser: More paper highlights • We can use the same technique to coordinate externally managed state with snapshots. • Epoch markers can act as on-the-fly reconfiguration points. • Internals of asynchronous and incremental snapshots. 26
  81. 81. Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology Stephan Ewen<stephan@data-artisans.com> - data Artisans Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology Stefan Richter<s.richter@data-artisans.com> - data Artisans Kostas Tzoumas<kostas@data-artisans.com> - data Artisans 27 State Management in Apache Flink® Consistent Stateful Distributed Stream Processing @vldb17

×