Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency with Flink's network stack"

NICO KRUBER
SOLUTION ARCHITECT / SOFTWARE ENGINEER @ DATA ARTISANS,
APACHE FLINK COMMITTER
IMPROVING THROUGHPUT AND LATENCY
WITH FLINK’S NETWORK STACK

© 2018 data Artisans2
FLINK DATA TRANSPORT (LOGICAL)
• Subtask output
‒ pipelined-bounded
‒ pipelined-unbounded
‒ Blocking
• Scheduling type
‒ all at once
‒ next stage on complete output
‒ next stage on first output
• Transport
‒ high throughput via buffers
‒ low latency via buffer timeout
Subtask 1
Subtask 2
Subtask 3
Subtask 4
Stream Partition
Abstraction over:

TaskManager 1 TaskManager 2
Subtask 4
1
2
Buffer Pool
Subtask 2
3
4
Buffer Pool
Subtask 3
1
2
Buffer Pool
Empty
Buffer
Subtask 1
3
4
Buffer Pool
Buffer with
Data in Queue
FLINK DATA TRANSPORT (PHYSICAL)
TCP Connection

Subtask 2
3
4
Buffer Pool
Subtask 1
3
4
Buffer Pool
Buffer with
Data in Queue
Subtask 4
1
2
Buffer Pool
Subtask 3
1
2
Buffer Pool
Empty
Buffer
TCP Connection
Backpressure

Subtask 2
3
4
Buffer Pool
Subtask 1
3
4
Buffer Pool
Subtask 4
1
2
Buffer Pool
Subtask 3
1
2
Buffer Pool
TCP Connection
Backpressure
Zoom in

CREDIT-BASED FLOW CONTROL

Subtask 4
Subtask 2
2
Buffer Pool
CREDIT-BASED FLOW CONTROL (FLINK 1.5+)
TCP Connection
1
Floating
Buffers
Exclusive
Buffers
Backlog

Subtask 4
Subtask 2
2
Buffer Pool
TCP Connection
Floating
Buffers
0
Unannounced
Credit
2
Channel
Credit
announce credit
Send buffers &
announce backlog size
4 Backlog size
Ask for
floating
buffers
321

• Never blocks the TCP connection
➢ Better resource utilization with data
skew in multiplexed connections
• Avoids overloading of slow receivers
(direct control over amount of buffered
data)
➢ Improves checkpoint alignment
• cost: additional announce messages
(piggy-bagged),
potential round-trip latency
Checkpoint Duration
Without Flow Control
With Flow Control

LOW LATENCY IMPROVEMENTS

Subtask 4
1
2
Subtask 2
Buffer Pool
Subtask 3
1
2
Subtask 1
3
4
NETWORK STACK (EXTENDED)
TCP Connection
NettyServer
Buffer Pool Buffer Pool
Buffer Pool
NettyClient
RecordWriter
3
4
RecordWriter
RecordReader
RecordReader
Zoom in

FROM RECORD TO NETWORK
Subtask 1
NettyServer
Buffer
Pool
StreamRecordWriter
Write data &
update writer index
notify
new data
take data &
remove buffer
get new
buffer

FROM RECORD TO NETWORK
Subtask 1
NettyServer
Buffer
Pool
StreamRecordWriter
Output
Flusher
flush
notify
new data
Write data &
update writer index
take data &
update reader index

BufferConsumer
BufferBuilder
BUFFER BUILDER & CONSUMER
• Producer-Consumer structure with lightweight synchronization
MemorySegment
volatile int
writePosition
int readPosition
• append()
• commit()
• finish() • build() →
Buffer.readOnly
Slice()
Buffer
(wrapping MemorySegment)

LATENCY VS. THROUGHPUT
▪ low latency via buffer timeout
*100 nodes x 8 slots
▪ high throughput through buffers

CONNECTION TYPES

LOCAL VS. REMOTE CONNECTIONS
• Every (unchained) connection:
‒ Requires serialization
‒ Assembles serialized records into buffers
‒ Forwards a buffer when it is full or the buffer timeout hit
• Remote connection:
‒ Sent via multiplexed Netty TCP connections (one per pair of tasks and task managers)
‒ As soon as a buffer is on the wire, it can be re-used
➢ Allows credit-based flow control to control amount of buffered data
• Local connection:
‒ Direct connection between sender and receiver: buffers are shared
➢ No need for further flow control (buffered data = sender buffers)

TUNING OPTIONS

• taskmanager.network.credit-model: true/false
• taskmanager.network.memory.buffers-per-channel: 2
• taskmanager.network.memory.floating-buffers-per-gate: 8
• Number of exclusive buffers should be enough to saturate the network for a full
round-trip-time (2 x network latency)
➢ #exclBuffers * segmentSize = round-trip-time * throughput
Subtask 4
Subtask 2
2
0
Unannounced
Credit2
Channel
Credit
announce credit
Send buffers &
0
Backlog size

• Number of exclusive buffers too high
➢ higher number of required network buffers
➢ buffering more during checkpoint alignment
➢ BUT: faster ramp-up (before floating buffers kick in)
• Number of exclusive buffers too low
➢ times of in-activity during ramp-up
Subtask 4
Subtask 2
2
0
Unannounced
Credit2
Channel
Credit
announce credit
Send buffers &
0
Backlog size

BUFFER TIMEOUT
• StreamExecutionEnvironment#setBufferTimeout()
• Affects every unchained connection: remote or local
➢ Upper bound on latency for low throughput channels(!)
➢ Trade-off throughput vs. latency (see earlier)

NETWORK THREADS
• netty.client.numThreads (default: number of slots)
• netty.server.numThreads (default: number of slots)
• May become a bottleneck if thread(s) are overloaded
• BUT: may also become an overhead if too many
➢ Do your own benchmarks and verify for your job!

USE LINUX-NATIVE EPOLL (FLINK 1.6+)
• taskmanager.network.netty.transport: AUTO | NIO | EPOLL
• EPOLL may reduce the channel polling overhead between user space and
kernel/system space
• There should be no downside in activating this or at least AUTO.
➢ Do your own benchmarks for your job!
• Please give feedback in FLINK-10177 so that we can decide whether to use
AUTO by default.

METRICS

NETWORK STACK METRICS
• Backpressure monitor
‒ Web/REST UI, /jobs/:jobid/vertices/:vertexid/backpressure)
• [input, output]QueueLength
• numRecords[In, Out]
• numBytesOut, numBytesIn[Local, Remote]
• numBuffersOut, numBuffersIn[Local, Remote] (Flink 1.5.3+, 1.6.1+)

LATENCY MARKERS
• ExecutionConfig#setLatencyTrackingInterval() (default: every 2s)
• Sources periodically emit a LatencyMarker with a timestamp
• These flow with the stream and properly queue behind records
• Latency markers bypass operators, e.g. windows
• Once received, they will be re-emitted onto a random output channel
• We create one histogram per source ↔ operator pair (window size: 128)
• source_id.<sourceId>.source_subtask_index.<subtaskIdx>.
operator_id.<operatorId>.operator_subtask_index.<subtaskIdx>
➢ 10 operators, parallelism 100 = 9 * 100 * 100 = 90,000 histograms!
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#latency-tracking

COMMON ANTIPATTERNS

REPEATED KEYBY’S ON THE SAME KEY
• KeyedStream is not retained
‒ UDF could have changed the key
• Additional keyBy() is necessary to gain access to keyed state, but:
‒ Prevents chaining
‒ Adds an additional shuffle
➢ DataStreamUtils#reinterpretAsKeyedStream
.keyBy(“location”)
.keyBy(“location”)

Subtask 4
Subtask 2
Buffer Pool
TCP Connection
Floating
Buffers
Exclusive
Buffers
Buffers: #channels * 2 + 8
LatencyMarker
➢ synchronization overhead
for the output flusher!

WHAT‘S UP NEXT?

NETWORK SERIALIZATION STACK (FLINK 1.7?)
• Serialization for broadcasts once per record, not channel
• Only one intermediate serialization buffer (on heap)
➢ significantly reduces the memory footprint
• see FLINK-9913
TaskManager 1
Subtask 2
RecordWriter

OPENSSL-BASED SSL ENGINE (FLINK 1.7?)
• Runs native code
• Uses advanced CPU instruction sets
➢ May reduce encryption/decryption overhead (needs verification)
• see FLINK-9816

MOVE OUTPUT FLUSHER TO NETTY
• Current implementation may have (GC) problems with many channels
➢ schedule the output flusher inside the Netty event loop
• see FLINK-8625

THANK YOU!
@dataArtisans
@ApacheFlink
WE ARE HIRING
data-artisans.com/careers

Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency with Flink's network stack"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency with Flink's network stack"

Similar to Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency with Flink's network stack" (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Flink Forward Berlin 2018: Nico Kruber - "Improving throughput and latency with Flink's network stack"