Apache Flink™ deep-dive
Unified Batch and Stream Processing
Robert Metzger
@rmetzger_
Flink’s Recent History
April 2014 April 2015Dec 2014
Top Level
Project
Graduation
0.70.60.5 0.90.9-m1
What is Flink
3
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Remote YARN Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
Zeppelin
Program compilation
4
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow Graph
Independent of
batch or
streaming job
deploy
operators
track
intermediate
results
 Layered Architecture
allows plugging of
components
Native workload support
5
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
 Low latency
 resource utilization  iterative algorithms
 Mutable state
E.g.: Non-native iterations
6
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
 Teaching an old elephant new tricks
 Treat system as a black box
E.g.: Non-native streaming
7
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
Data Stream
 Simulate stream processor with batch system
Native workload support
8
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
 Low latency
 resource utilization  iterative algorithms
 Mutable state
Ingredients for “native” support
1. Execute everything as streams
Pipelined execution, push model
2. Special code paths for batch
Automatic job optimization, fault tolerance
3. Allow some iterative (cyclic) dataflows
4. Allow some mutable state
5. Operate on managed memory
Make data processing on the JVM robust
9
Flink by Use Case
10
Stream data processing
streaming dataflows
11
Full talk tomorrow:
3:10PM, Grand Ballroom 220A
Stream processing with Flink
Pipelined stream processor
12
Streaming
Shuffle!
 Low latency
 Operators push data
forward
Expressive APIs
13
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Checkpointing / Recovery
14
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
(backup till next snapshot)
 Guarantees exactly-once
processing
Batch processing
Batch on Streaming
15
Batch on an streaming engine
16
File in HDFS
Filter Map Result 1
Map Result 2
 Batch program, completely pipelined
 Data is never materialized anywhere (in this example)
Batch on an streaming engine
Map
Operator
Map
Operator
Map
Operator
17
Data
Source
(small)
Stream
Data
Sink
Data
Sink
Data
Sink
Join
Operator
in parallel
Data
Source
(large)
Data
Sink
in parallel (once build side finished)
Map
Batch processing requirements
 Get the data processed as fast as possible
• Automatic job optimizer
• Efficient memory management
 Robust processing
• provide fault-tolerance
• again, memory management
18
Optimizer
 Cost-based optimizer
 Select data shipping strategy (forward, partition, broadcast)
 Local execution (sort merge join/hash join)
 Caching of loop invariant data (iterations)
19
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Pre-flight (Client)
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
Two execution plans
20
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Best plan
depends on
relative sizes
of input files
Memory Management
21
Operators on managed memory
22
Smooth out-of-core performance
23
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
Machine Learning Algorithms
Iterative data flows
24
Iterate in the Dataflow
26
 API and runtime support
 Automatic caching of loop invariant
data
IterationState state =
getInitialState();
while (!terminationCriterion()) {
state = step(state);
}
setFinalState(state);
Example: Matrix Factorization
27
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
Setups:
• 40 medium instances ("n1-highmem-8" - 8
cores, 52 GB)
• 40 large instances ("n1-highmem-16" - 16
cores, 104 GB)
Flink ML – Machine Learning
 Provide a complete toolchain
• scikit-learn style pipelining
• Data pre-processing
 various algorithms
• Recommendations: ALS
• Supervised learning: Support Vector Machines
• …
 ML on streams: SAMOA. We are planning to add support for
streaming into ML
28
Graph Analysis
Stateful Iterations
29
Graph processing characteristics
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration
Iterate natively with state/deltas
31
 Keep state in an controlled way by having a partitioned hash-
map
 Relax immutability assumption of batch processing
… fast graph analysis
32More at: http://data-artisans.com/data-analysis-with-flink.html
Gelly – Graph Processing API
33
 Transformations: map, filter, subgraph, union, reverse,
undirected
 Mutations: add vertex/edge, remove …
 Pregel style vertex centric iterations
 Library of algorithms
 Utilities: Special data types, loading, graph properties
Gelly and Flink ML:
 Available in Flink 0.9 (so far only beta release)
 Still under heavy development
 Seamlessly integrate with DataSet abstraction
Preprocess data as needed
Use results as needed
 Easy entry point for new contributors
34
Closing
35
Flink Meetup Groups
 SF Spark and Friends
• June 16, San Francisco
 Bay Area Flink Meetup
• June 17, Redwood City
 Chicago Flink Meetup
• June 30
 Stockholm, Sweden
 Berlin, Germany
36
Flink Forward registration & call for
abstracts is open now
flink.apache.org 37
• 12/13 October 2015
• Meet developers and users of Flink!
• With Flink Workshops / Trainings!
flink.apache.org
@ApacheFlink
39
Backup
Flink community
0
20
40
60
80
100
120
Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15
#unique contributor ids by git commits
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency,
windowing,
aggregations, ...
Event logs
Real-time data
streams
What is Apache Flink?
(master)
Cornerpoints of Flink Design
42
Robust Algorithms on
Managed Memory
Pipelined Execution
of Batch Programs
 Better shuffle performance
 No OutOfMemory Errors
 Scales to very large JVMs
 Efficient an robust processing
Flexible Data
Streaming Engine
 Low Latency Steam Proc.
 Highly flexible windows
Native Iterations
 Very fast Graph Processing
 Stateful Iterations for ML
High-level APIs,
beyond key/value pairs
 Java/Scala/Python (upcoming)
 Relational-style optimizer
 Graphs / Machine Learning
 Streaming ML (coming)
 Scales to very large groups
Active Library Development
Defining windows in Flink
 Trigger policy
• When to trigger the computation on current window
 Eviction policy
• When data points should leave the window
• Defines window width/size
 E.g., count-based policy
• evict when #elements > n
• start a new window every n-th element
 Built-in: Count, Time, Delta policies
43
Streaming checkpoints
44
45
46
Program optimization
47
A simple program
48
val orders = …
val lineitems = …
val filteredOrders = orders
.filter(o => dataFormat.parse(l.shipDate).after(date))
.filter(o => o.shipPrio > 2)
val lineitemsOfOrders = filteredOrders
.join(lineitems)
.where(“orderId”).equalTo(“orderId”)
.apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice))
val priceSums = lineitemsOfOrders
.groupBy(“orderDate”).sum(“l.extdPrice”);
Two execution plans
49
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forwardBest plan
depends on
relative sizes
of input files
Examples of optimization
 Task chaining
• Coalesce map/filter/etc tasks
 Join optimizations
• Broadcast/partition, build/probe side, hash or sort-merge
 Interesting properties
• Re-use partitioning and sorting for later operations
 Automatic caching
• E.g., for iterations
50
Visualization
51
Visualization tools
52
Visualization tools
53
Visualization tools
54
Batch processing 2
Batch on Streaming
55
Batch Pipelines
56
Operators Execution Overlaps
58
Memory Management
59
Memory Management
60
Smooth out-of-core performance
61
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
optimizer
62

Apache Flink Deep Dive

  • 1.
    Apache Flink™ deep-dive UnifiedBatch and Stream Processing Robert Metzger @rmetzger_
  • 2.
    Flink’s Recent History April2014 April 2015Dec 2014 Top Level Project Graduation 0.70.60.5 0.90.9-m1
  • 3.
    What is Flink 3 Gelly Table ML SAMOA DataSet(Java/Scala) DataStream HadoopM/R Local Remote YARN Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime Zeppelin
  • 4.
    Program compilation 4 case classPath (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph Independent of batch or streaming job deploy operators track intermediate results  Layered Architecture allows plugging of components
  • 5.
    Native workload support 5 Flink Streaming topologies Longbatch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis  Low latency  resource utilization  iterative algorithms  Mutable state
  • 6.
    E.g.: Non-native iterations 6 StepStep Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }  Teaching an old elephant new tricks  Treat system as a black box
  • 7.
    E.g.: Non-native streaming 7 stream discretizer JobJob Job Job while (true) { // get next few records // issue batch job } Data Stream  Simulate stream processor with batch system
  • 8.
    Native workload support 8 Flink Streaming topologies Longbatch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis  Low latency  resource utilization  iterative algorithms  Mutable state
  • 9.
    Ingredients for “native”support 1. Execute everything as streams Pipelined execution, push model 2. Special code paths for batch Automatic job optimization, fault tolerance 3. Allow some iterative (cyclic) dataflows 4. Allow some mutable state 5. Operate on managed memory Make data processing on the JVM robust 9
  • 10.
    Flink by UseCase 10
  • 11.
    Stream data processing streamingdataflows 11 Full talk tomorrow: 3:10PM, Grand Ballroom 220A Stream processing with Flink
  • 12.
    Pipelined stream processor 12 Streaming Shuffle! Low latency  Operators push data forward
  • 13.
    Expressive APIs 13 case classWord (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 14.
    Checkpointing / Recovery 14 Chandy-LamportAlgorithm for consistent asynchronous distributed snapshots Pushes checkpoint barriers through the data flow Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot (backup till next snapshot)  Guarantees exactly-once processing
  • 15.
  • 16.
    Batch on anstreaming engine 16 File in HDFS Filter Map Result 1 Map Result 2  Batch program, completely pipelined  Data is never materialized anywhere (in this example)
  • 17.
    Batch on anstreaming engine Map Operator Map Operator Map Operator 17 Data Source (small) Stream Data Sink Data Sink Data Sink Join Operator in parallel Data Source (large) Data Sink in parallel (once build side finished) Map
  • 18.
    Batch processing requirements Get the data processed as fast as possible • Automatic job optimizer • Efficient memory management  Robust processing • provide fault-tolerance • again, memory management 18
  • 19.
    Optimizer  Cost-based optimizer Select data shipping strategy (forward, partition, broadcast)  Local execution (sort merge join/hash join)  Caching of loop invariant data (iterations) 19 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Pre-flight (Client) Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph
  • 20.
    Two execution plans 20 DataSource orders.tbl Filter MapDataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine GroupRed sort DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forward Best plan depends on relative sizes of input files
  • 21.
  • 22.
  • 23.
    Smooth out-of-core performance 23 Moreat: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Blue bars are in-memory, orange bars (partially) out-of-core
  • 24.
  • 25.
    Iterate in theDataflow 26  API and runtime support  Automatic caching of loop invariant data IterationState state = getInitialState(); while (!terminationCriterion()) { state = step(state); } setFinalState(state);
  • 26.
    Example: Matrix Factorization 27 Factorizinga matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html Setups: • 40 medium instances ("n1-highmem-8" - 8 cores, 52 GB) • 40 large instances ("n1-highmem-16" - 16 cores, 104 GB)
  • 27.
    Flink ML –Machine Learning  Provide a complete toolchain • scikit-learn style pipelining • Data pre-processing  various algorithms • Recommendations: ALS • Supervised learning: Support Vector Machines • …  ML on streams: SAMOA. We are planning to add support for streaming into ML 28
  • 28.
  • 29.
  • 30.
    Iterate natively withstate/deltas 31  Keep state in an controlled way by having a partitioned hash- map  Relax immutability assumption of batch processing
  • 31.
    … fast graphanalysis 32More at: http://data-artisans.com/data-analysis-with-flink.html
  • 32.
    Gelly – GraphProcessing API 33  Transformations: map, filter, subgraph, union, reverse, undirected  Mutations: add vertex/edge, remove …  Pregel style vertex centric iterations  Library of algorithms  Utilities: Special data types, loading, graph properties
  • 33.
    Gelly and FlinkML:  Available in Flink 0.9 (so far only beta release)  Still under heavy development  Seamlessly integrate with DataSet abstraction Preprocess data as needed Use results as needed  Easy entry point for new contributors 34
  • 34.
  • 35.
    Flink Meetup Groups SF Spark and Friends • June 16, San Francisco  Bay Area Flink Meetup • June 17, Redwood City  Chicago Flink Meetup • June 30  Stockholm, Sweden  Berlin, Germany 36
  • 36.
    Flink Forward registration& call for abstracts is open now flink.apache.org 37 • 12/13 October 2015 • Meet developers and users of Flink! • With Flink Workshops / Trainings!
  • 37.
  • 38.
  • 39.
    Flink community 0 20 40 60 80 100 120 Aug-10 Feb-11Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15 #unique contributor ids by git commits
  • 40.
    Flink Historic data Kafka, RabbitMQ,... HDFS, JDBC, ... ETL, Graphs, Machine Learning Relational, … Low latency, windowing, aggregations, ... Event logs Real-time data streams What is Apache Flink? (master)
  • 41.
    Cornerpoints of FlinkDesign 42 Robust Algorithms on Managed Memory Pipelined Execution of Batch Programs  Better shuffle performance  No OutOfMemory Errors  Scales to very large JVMs  Efficient an robust processing Flexible Data Streaming Engine  Low Latency Steam Proc.  Highly flexible windows Native Iterations  Very fast Graph Processing  Stateful Iterations for ML High-level APIs, beyond key/value pairs  Java/Scala/Python (upcoming)  Relational-style optimizer  Graphs / Machine Learning  Streaming ML (coming)  Scales to very large groups Active Library Development
  • 42.
    Defining windows inFlink  Trigger policy • When to trigger the computation on current window  Eviction policy • When data points should leave the window • Defines window width/size  E.g., count-based policy • evict when #elements > n • start a new window every n-th element  Built-in: Count, Time, Delta policies 43
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
    A simple program 48 valorders = … val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”).sum(“l.extdPrice”);
  • 48.
    Two execution plans 49 DataSource orders.tbl Filter MapDataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine GroupRed sort DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forwardBest plan depends on relative sizes of input files
  • 49.
    Examples of optimization Task chaining • Coalesce map/filter/etc tasks  Join optimizations • Broadcast/partition, build/probe side, hash or sort-merge  Interesting properties • Re-use partitioning and sorting for later operations  Automatic caching • E.g., for iterations 50
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
    Batch processing 2 Batchon Streaming 55
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
    Smooth out-of-core performance 61 Moreat: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Blue bars are in-memory, orange bars (partially) out-of-core
  • 60.

Editor's Notes

  • #2 40 minutes
  • #4 Flink is an entire software stack the heart: streaming dataflow engine: think of programs as operators and data flows Kappa architecture: run batch programs on a streaming system Table API: logical representation, sql-style Samoa “on-line learners”
  • #5 toy program: native transitive closure type extraction: types that go in and out of each operator
  • #6 Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  • #7 Used for Machine Learning run the same job over the data multiple times to come up with parameters for a ml model this is how you do it when treating the engine as a black box
  • #8 If you only have a batch processor: do a lot of small batch jobs LIMITATION: state across the small jobs (batches)
  • #9 Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  • #10 Corner points / requirements for flink keep data in motion, avoid materialization even though it’s a streaming runtime, have special paths for batch: OPTIMIZER, CHECKPOINTING make the system aware of cyclic data flows, in a controlled way allow operators to have some state, in a controlled way (DELTA-ITERATIONS). relax “traditional” batch assumption flink runs in the jvm, but we want control over memory, not rely on GC
  • #11 Explain flink by use case
  • #13 pipelined execution: logical way you would go for low latency. no synchronization barriers, records keep flowing streaming shuffle (for example w/ hash code) push model maintain state inside long lived operators (!= mini batch)
  • #14 nice, fluent APIs known from the batch world window definitions
  • #15 Low overhead snapshots using “batched” snapshots Exactly once processing guarantees (without doing mini batches) How does it work: periodically push barriers through the streams. when a barrier reaches an operator, snapshot state when barrier reaches sinks, a checkpoint is completed (secured) multiple parallel checkpoints chop stream into generations (pre checkpointed, post checkpointed)
  • #16 structure, different title
  • #18 Measure effect of pipeline parallelism blocking happens in operators (join build side)
  • #22 in flink operators are running at the same time  need to control memory
  • #24 example: hash join robust in memory graceful behavior
  • #26 needed for machine learning
  • #27 function that encapsulate the transformation “Hadoop job” deploy this once, keep running across iterations (also you can keep state) allow to feed back data to the beginning
  • #28 1 Tb of input data many terabytes of intermediate data 40 machines cluster @ google compute
  • #29 SVM = supervised learning ALS = Recommendation
  • #32 keep mutable state in a controlled way by having a hash map locally on each machine great documentation on flink website
  • #34 Weakly connected components, page rank, label propagation,
  • #41 dev list: 300-400 messages/month. record 1000 messages on
  • #56 structure, different title
  • #57 visualization of dataflow graph in flink Explain: sources, maps, binary operators (join, …) This is an actual example of a Flink job we’ve seen from an industry user candidate flights out of all available flights
  • #58 you need windows in stream programs: grouping for ex on infinite stream is impossible pipelined data ex: needed for low latency in batch: pipelined or blocking, if we are optimizing (for ex we don’t want all operators online at the same time)
  • #59 overlapping operators operators start as soon as data is available join and co-group start early, (co group starts sorting incoming data)
  • #60 in flink operators are running at the same time  need to control memory
  • #62 example: hash join robust in memory graceful behavior