Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling stream data pipelines"

Scaling stream data pipelines
Flavio Junqueira, Pravega - Dell EMC
Till Rohrmann, Data Artisans

Motivation
Flink Forward - San Francisco, 2018 2

Flink Forward - San Francisco, 2018
Social networks
Online shopping
Streams ahoy!
Stream of user events
• Status updates
• Online transactions
3

Social networks
Online shopping
Server monitoring
• Status updates
Stream of server events
• CPU, memory, disk utilization
Streams ahoy!
4

Social networks
Online shopping
Server monitoring
Sensors (IoT)
• Status updates
Stream of server events
• CPU, memory, disk utilization
Stream of sensor events
• Temperature samples
• Samples from radar and image sensors in cars
Streams ahoy!
5

Workload cycles and seasonal spikes
Daily cycles
NYC Yellow Taxi Trip Records, March 2015
http://www.nyc.gov/html/tlc/html/about/trip
_record_data.shtml
Seasonal spikes
https://www.slideshare.net/iwmw/building-
highly-scalable-web-applications/7-
Seasonal_Spikes

Workload cycles and spikes
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Seasonal spikes
0:00
2:00
4:00
6:00
8:00
10:00
12:00
14:00
16:00
18:00
20:00
22:00
1:00
3:00
5:00
7:00
9:00
11:00
13:00
15:00
17:00
19:00
21:00
23:00
Daily cycles
0
2
4
6
8
10
12
14
Weekly cycles
Unplanned

Overprovisioning… what if we don’t want to
overprovision?

Event processing
Processor 1Source
Source emits 2
events/second
Processor processes
3 events/second
Append-only Log
Colors represent event keys

Event processing
Source
Processor processes
3 events/second
Processor 1
Source emits 2
events/second
Append-only Log

Event processing
Source
ü Source rate
increases
ü New rate: 4
events/second
ü Processor still processes 3
events/second
ü Can’t keep up with the
source rate
Processor 1
Append-only Log

Event processing
Source
ü Source rate
increases
ü New rate: 4
events/second
Processor 1
Append-only Log
Processor 2
ü Add a second processor
ü Each processor processes 3
events/second
ü Can keep up with the rate

Event processing
Source
ü Source rate
increases
ü New rate: 4
events/second
Processor 1
Append-only Log
Processor 2
events/second
Problem: Per-key order

Event processing
Source
Processor 1
Processor 2
ü Source rate
increases
ü New rate: 4
events/second
events/second
Split the input and
add processors
Append-only Log

Event processing
Source
Processor 1
Processor 2
ü Source rate
increases
ü New rate: 4
events/second
events/second
Split the input and
add processors
Append-only Log
Problem: Per-key order

Event processing
Source
Processor 1
Processor 2
ü Source rate
increases
ü New rate: 4
events/second
events/second
Split the input and
add processors
Processor 2 only starts once earlier
events have been processed

What about the order of events?
What happens if the rate increases again?
What if it drops?

Scaling in Pravega

Pravega
• Storing data streams
• Young project, under active development
• Open source
http://pravega.io
http://github.com/pravega/pravega
19Flink Forward - San Francisco, 2018

Time
PresentRecent
Past
Distant
Past
Anatomy of a stream
20

Messaging
Pub-sub
Bulk store
Time
PresentRecent
Past
Distant
Past
Anatomy of a stream
21

Time
PresentRecent
Past
Distant
Past
Anatomy of a stream
22
Pravega

Time
PresentRecent
Past
Distant
Past
Anatomy of a stream
Unbounded
amount of data
Ingestion rate
might vary
23
Pravega

Pravega aims to be a stream store able to:
• Store stream data permanently
• Preserve order
• Accommodate unbounded streams
• Adapt to varying workloads automatically
• Low-latency from append to read

Pravega and Streams
….. 01110110 01100001 01101100
….. 01001010 01101111 01101001
Pravega
01000110
01110110
Append Read
01000110
01110110
Ingest stream data Process stream data
25

Pravega and Streams
01000110
01110110
Append Read
Event writer
Event writer
Event reader
Event reader
Group
• Load balance
• Grow and shrink
Pravega
Ingest stream data Process stream data

Segments in Pravega
01000111
01110110
11000110
01000111
01110110
11000110
Pravega
Stream Composition of
Segment:
• Stream unit
• Append only
• Sequence of bytes
27

Parallelism

Segments in Pravega
Pravega
01000110
01110110
Segments
Append Read
01000110
01110110
01101111
01101001 01101001
01101111
Segments
• Segments are sequences of bytes
• Use routing keys to determine segment
〈key, 01101001 〉
Routing
key
….. 01110110 01100001 01101100
….. 01001010 01101111 01101001
29

Segments can be sealed

Segments in Pravega
….. 01110110 01100001 01101100
….. 01101001 01110110 01001010
Pravega
01000110
01110110
Segments
Append Read
01000110
01110110
01101111
01101001 01101001
01101111
Segments
Once sealed, a segment
can’t be appended to any
longer.
E.g., ad clicks
31

How is sealing segments useful?

Segments in Pravega
Pravega
01000110
Segments
Segments
01101111
01000110
01000110
01000110
01101111
01101111
01101111
01101111
01000110
01000110
0110111101101111
01000110
01101111
Stream
Compose to form a stream

Segments in Pravega
01000110
Segments
Segments
01101111
01000110
01000110
01000110
01101111
01101111
01101111
01101111
01000110
01000110
0110111101101111
01000110
01101111
Stream
• Each segment can live in a different server
• Not limited to the capacity of a single server
• Unbounded streams
00101111 01101001
34
Pravega

Segments in Pravega
01000110
Segments
Segments
01101111
01000110
01000110
01000110
01101111
01101111
01101111
01101111
01000110
01000110
01101111
01000110
01101111
Stream
01101111
Pravega

Stream scaling

01000110
Scaling a stream
….. 01110110 01100001 01101100 01000110
• Stream has one
segment
1
….. 01110110 01100001 01101100
• Seal current
segment
• Create new ones
2
01000110
01000110
• Say input load has increased
• Need more parallelism
• Auto or manual scaling

Routing
key space
0.0
1.0
Time
Split Split Merge
0.5
0.75
Segment 1 Segment 2
Segment 3
Segment 4
Segment 5
Segment 6
t0 t1
t2

Routing
key space
0.0
1.0
Time
0.5
0.75
Segment 1 Segment 2
Segment 3
Segment 4
Segment 5
Segment 6
t0 t1
t2
Key ranges are not statically
assigned to segments
Split Split Merge

Daily Cycles
Peak rate is 10x higher than lowest rate
4:00 AM
9:00 AM
NYC Yellow Taxi Trip Records, March 2015
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Pravega Auto Scaling
Merge Split

Source: Virtual cluster - Nautilus Platform

Source: Virtual cluster - Nautilus PlatformScale up
Scale down

How do I control scaling?

Scaling policies
• Configured on a per stream basis
• Specifies a policy for the stream
• Policies
• Fixed
• Set of segments is fixed
• Bytes per second
• Scales up and down according to volume of data
• Target data rate
• Events per second
• Scales up and down according to volume of events
• Target event rate

Auto-Scaling: Triggering a scaling event
• By byte and event rates
• Target T per segment
• Reports every 2 minutes
ü 2-min rate (2M)
ü 5-min rate (5M)
ü 10-min rate (10M)
ü 20-min rate (20M)
Scale up
x x + 2 min x + 4 min x + 6 min time
• Scaling down
∧ 2M, 5M, 10M < T
∧ 20M < T / 2
2M = 60
5M = 56
10M = 46
T = 50
2M = 60
5M = 60
10M = 48
T = 50
2M = 60
5M = 60
10M = 5
T = 50
2M = 60
5M = 60
10M = 52
T = 50
Scale down
x x + 2 min x + 4 min x + 6 min time
2M = 20
5M = 20
10M = 20
20M = 27
T = 50
• Scaling up
∨ 2M > 5 x T
∨ 5M > 2 x T
∨ 10M > T
2M = 20
5M = 20
10M = 20
20M = 26
T = 50
2M = 20
5M = 20
10M = 20
20M = 25
T = 50
2M = 20
5M = 20
10M = 20
20M = 24
T = 50

Read order

Reader groups + Scaling
Pravega
Segment 2
Segment 1
Reader
Reader
1
Pravega
Segment 2
Segment 1
Reader
Reader
2
Segment 3
Segment 4
Scale up!

Reader groups + Scaling
Pravega
Segment 2
Segment 1
Reader
Reader
3
Segment 3
Segment 4
• Hit end of segment
• Get successors
• Update reader group state
Pravega
Reader
Reader
4
Segment 4
Segment 2
Segment 3
Pravega
Reader {3}
Reader {2, 4}
5
Segment 4
Segment 2
Segment 3

Building pipelines –
Scaling downstream

Scaling pipelines
Stage 1 Stage 2Source
All stages can handle the load induced by the source

Scaling pipelines
Scaled
Stage 1 Stage 2Big source
Stage 2 can’t cope with
the load change
Load coming from
source increases
Stage 1 scales and
adapts to the load
change

Scaling signals
Pravega AppBig source
• Pravega won’t scale
the application

Scaling signals
Pravega AppBig source
• Pravega won’t scale the
application downstream
• … but it can signal
• E.g., more segments
• E.g., number of unread
bytes is growing
Signals from Pravega

Reader group notifier
• Listener API
• Register a listener to react to changes
• E.g., changes to the number of segments
ReaderGroupManager groupManager = new ReaderGroupManagerImpl(SCOPE, controller,
clientFactory, connectionFactory);
ReaderGroup readerGroup = groupManager.createReaderGroup(GROUP_NAME,
ReaderGroupConfig.builder().build(), Collections.singleton(STREAM));
readerGroup.getSegmentNotifier(executor).registerListener(segmentNotification -> {
int numOfReaders = segmentNotification.getNumOfReaders();
int segments = segmentNotification.getNumOfSegments();
if (numOfReaders < segments) {
//Scale up number of readers based on application capacity
} else {
//More readers available time to shut down some
}
});

Reader group: listener and metrics
• Listener API
• Register a listener to react to changes
• E.g., changes to the number of segments
• Metrics
• Reports specific values of interest
• E.g., number of unread bytes in a stream

Consuming Pravega streams
with Apache Flink

How to read Pravega streams with Flink?
Task Manager
ReaderPravega Stream
• FlinkPravegaReader
• ReaderGroup
• Assignment of segments
• Rebalance
• Key to automatic scaling Task Manager
Task Manager
Task Manager
Reader
https://github.com/pravega/flink-connectors

How to react to segment changes?
Pravega Stream
Task Manager
Job Manager
Reader Rescaling
Policy
Task Manager Task Manager
(2) Segment change
notification
(1) Register segment
listener
(3) Rescale job
(4) Take savepoint
(5) Redeploy & resume tasks
Reader

Scaling signals
61
• Latency
• Throughput
• Resource utilization
• Connector signals

Rescaling Flink applications

Scaling stateless jobs
63
Scale Up Scale Down
Source
Mapper
Sink
• Scale up: Deploy new tasks
• Scale down: Cancel running tasks

Scaling stateful jobs
64
?
• Problem: Which state to assign to new task?

Different state types in Flink

Keyed vs. operator state
66
• State bound to a key
• E.g. Keyed UDF and window state
• State bound to a subtask
• E.g. Source state
Keyed Operator

Repartitioning keyed state
• Similar to consistent hashing
• Split key space into key groups
• Assign key groups to tasks
67
Key space
Key group #1 Key group #2
Key group #3Key group #4

Repartitioning keyed state contd.
• Rescaling changes key group
assignment
• Maximum parallelism defined by
#key groups
68Flink Forward - San Francisco, 2018

Repartitioning operator state
• Breaking operator state up into finer
granularity
• State has to contain multiple entries
• Automatic repartitioning wrt granularity
69
#1 #2
#3

Acquiring New Resources –
Resource Elasticity

Flink’s Revamped Distributed Architecture
• Motivation
• Resource elasticity
• Support for different deployments
• REST interface for client-cluster
communication
• Introduce generic building blocks
• Compose blocks for different scenarios

The Building Blocks
72
• ClusterManager-specific
• May live across jobs
• Manages available Containers/TaskManagers
• Used to acquire / release resources
ResourceManager
TaskManagerJobManager
• Registers at ResourceManager
• Gets tasks from one or more JobManagers
• Single job only, started per job
• Thinks in terms of "task slots"
• Deploys and monitors job/task execution
Dispatcher
• Lives across jobs
• Touch-point for job submissions
• Spawns JobManagers

The Building Blocks
73
ResourceManager
(3) Request slots
TaskManager
JobManager
(4) Start TaskManagers
(5) Register
(7) Deploy Tasks
Dispatcher
Client
(1) Submit Job
(2) Start
JobManager
(6) Offer slots

Building Flink-on-YARN
74
YARN
ResourceManager
YARN Cluster
YARN Cluster
Client
(1) Submit YARN App.
(JobGraph / JARs)
Application Master
Flink-YARN
ResourceManager
JobManager TaskManager
TaskManager
TaskManager
(2) Spawn
Application Master
(4) Start
TaskManagers
(6) Deploy
Tasks
(5) Register
(3) Request slots

Does It Actually Work?

Demo Topology
76
Pravega
Source Sink
FILE.out
• Executed on Yarn to support dynamic resource allocation
time
Event rate

Wrap Up

Wrap up
• Pravega
• Stream store
• Scalable ingestion of continuously generated data
• Stream scaling
• Apache Flink
• Stateful job scaling
• Full resource elasticity
• Operator rescaling policies work in progress
• Pravega + Apache Flink
• End-to-end scalable data pipelines

Questions?
http://pravega.io
http://github.com/pravega/pravega
http://flink.apache.org
http://github.com/pravega/flink-connectors
https://github.com/tillrohrmann/flink/tree/rescalingPolicy
E-mail: fpj@apache.org, trohrmann@apache.org
Twitter: @fpjunqueira, @stsffap

Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling stream data pipelines"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling stream data pipelines"

Similar to Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling stream data pipelines" (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Flink Forward San Francisco 2018: Till Rohrmann & Flavio Junqueira - "Scaling stream data pipelines"