Introduction to stream processing
with Apache Flink
Seif Haridi
KTH/SICS
Stream processing
2Data Science Summit 2015
Why streaming
3
Data
Warehouse
Batch
Data availability Streaming
2008 20152000
- Which data?
- When?
- Who?
Data Science Summit 2015 S. Haridi
3 Parts of a Streaming Infrastructure
4
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
Data Science Summit 2015 S. Haridi
Example: Bouygues Telecom
5Data Science Summit 2015 S. Haridi
• Network and subscriber data
gathered
• Added to Broker in raw format
• Transformed and analyzed by
streaming engine
• Stored back for further procesing
http://data-artisans.com/flink-at-bouygues.html
What is Apache Flink?
6Data Science Summit 2015
1 year of Flink - code
April 2014 April 2015
Data Science Summit 2015 S. Haridi 7
What is Apache Flink
8
Distributed Data Flow Processing System
▪Focused on large-scale data analytics
▪Unified real-time stream and batch processing
▪Expressive and rich APIs in Java / Scala (+ Python)
▪Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
Data Science Summit 2015 S. Haridi
Flink Stack
9
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Cluster Yarn
Tez
Embedded
Dataflow
Dataflow
Table
Streaming dataflow
runtime
Storm
Zeppelin
Data Science Summit 2015 S. Haridi
Stream Processing with Flink
10Data Science Summit 2015
What is Flink Streaming
11
 Native, low-latency stream processor
 Expressive functional API
 Flexible operator state, iterations, windows
 Exactly-once processing semantics
Data Science Summit 2015 S. Haridi
Native vs non-native streaming
12
Stream
discretizer
Job Job Job Jobwhile (true) {
// get next few records
// issue batch computation
}
Non-native streaming
while (true) {
// process next record
}
Long-standing
operators
Native streaming
Data Science Summit 2015 S. Haridi
Stream processing in Flink
 Continuous Streaming model
 Low processing latency
 O(1) state updates per operator
 Exactly once semantics for state
operators
Data Science Summit 2015 S. Haridi 13
DataStream API
14Data Science Summit 2015
Overview of the API
15Data Science Summit 2015 S. Haridi
Windowing Semantics
16
• Trigger and Eviction policies
• window(<eviction>).every(<trigger>)
• Built-in policies:
– Time: Time.of(length, TimeUnit/Custom timestamp)
– window(Time.of(20, SECONDS))
– Count: Count.of(windowSize)
– window(Count.of(20)).every(Count.of(10))
– Delta: Delta.of(Threshold, Distance function, Start value)
– window(Delta.of(0.1, priceDistanceFun, initPrice)
Data Science Summit 2015 S. Haridi
Word count in Batch and Streaming
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word”).window(Time.of(5,SECONDS))
.every(Time.of(1,SECONDS)).sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Data Science Summit 2015 S. Haridi
Flexible windows
18
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Keyed Stream Windowed StreamData Stream Keyed Stream Windowed Stream
 Stream of stocks
 Trigger warning if price fluctuates by 5%
 Count the number of warnings per stock in
30 second (tumbling) window
 Do it continuously
Data Science Summit 2015 S. Haridi
Stock
Stream
Delta 5%
of price
Warning Count
30 sec
window Sum
keyBy
symbol
keyBy
symbol
Flexible windows
19
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
case class Count(symbol: String, count:
Int)
val defaultPrice = StockPrice(“”, 1000)
val priceWarnings =
stockStream.keyBy(“symbol”)
.window(Delta.of(0.05, priceChange,
defaultPrice)
.mapWindow(sendWarning _)
Use delta policy to create
change warnings
Count number of warning
per stock every half a minute
val warningPerStock = priceWarnings.flatten()
.map(Count(_, 1))
.keyBy(“symbol”)
.window(Time.of(30, SECONDS))
.sum(“count”)
Data Science Summit 2015 S. Haridi
Stock
Stream
Delta 5%
of price
Warning Count
30 sec
window
Sum
keyBy
symbol
keyBy
symbol
Iterative stream processing
20
Motivation
 Many applications require cyclic
streams
 Machine learning applications (parallel
model training, evaluation)
Iterations in Flink Streaming
 Native support for cyclic dataflows
 Integrated with functional API
 High performance and expressivity
Input
Train
Evaluate
Data Science Summit 2015 S. Haridi
Fault tolerance
21Data Science Summit 2015
Exactly-once processing in for operator
state
22
 Based on consistent global snapshots
 Low runtime overhead, stateful exactly-
once semantics
Data Science Summit 2015 S. Haridi
Checkpointing / Recovery
23
Detailed algorithm: Lightweight Asynchronous Snapshots for Distributed Dataflows
Data Science Summit 2015 S. Haridi
Fault tolerance
 Check-pointing and recovery of operator state
is very fast
• Data processing does not block
 Executions based on CPU/operator time are
not idempotent
 Other execution modes are based on
timestamps of input streams (Event/Ingress
time)
• Allows idempotent executions
• End-to-End exactly-once semantics
• In Flink version 0.10
24Data Science Summit 2015 S. Haridi
Streaming in Apache Flink
 True streaming over stateful distributed
dataflow engine
 Expressive Streaming API in Java/Scala
• Flexible window semantics
• Iterative computation
 Low streaming latency, exactly-once
semantics depending on execution
mode, and low overhead for recovery
25Data Science Summit 2015 S. Haridi
Special Thanks to
Gyula Fora, SICS
Paris Carbone, KTH
Kostas Tzoumas, Data Artisans
Stephan Ewen, Data Artisans
Volker Markl, TU-Berlin
26Data Science Summit 2015

SICS: Apache Flink Streaming

  • 1.
    Introduction to streamprocessing with Apache Flink Seif Haridi KTH/SICS
  • 2.
  • 3.
    Why streaming 3 Data Warehouse Batch Data availabilityStreaming 2008 20152000 - Which data? - When? - Who? Data Science Summit 2015 S. Haridi
  • 4.
    3 Parts ofa Streaming Infrastructure 4 Gathering Broker Analysis Sensors Transaction logs … Server Logs Data Science Summit 2015 S. Haridi
  • 5.
    Example: Bouygues Telecom 5DataScience Summit 2015 S. Haridi • Network and subscriber data gathered • Added to Broker in raw format • Transformed and analyzed by streaming engine • Stored back for further procesing http://data-artisans.com/flink-at-bouygues.html
  • 6.
    What is ApacheFlink? 6Data Science Summit 2015
  • 7.
    1 year ofFlink - code April 2014 April 2015 Data Science Summit 2015 S. Haridi 7
  • 8.
    What is ApacheFlink 8 Distributed Data Flow Processing System ▪Focused on large-scale data analytics ▪Unified real-time stream and batch processing ▪Expressive and rich APIs in Java / Scala (+ Python) ▪Robust and fast execution backend Reduce Join Filter Reduce Map Iterate Source Sink Source Data Science Summit 2015 S. Haridi
  • 9.
    Flink Stack 9 Gelly Table ML SAMOA DataSet (Java/Scala)DataStream (Java/Scala) HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow Table Streaming dataflow runtime Storm Zeppelin Data Science Summit 2015 S. Haridi
  • 10.
    Stream Processing withFlink 10Data Science Summit 2015
  • 11.
    What is FlinkStreaming 11  Native, low-latency stream processor  Expressive functional API  Flexible operator state, iterations, windows  Exactly-once processing semantics Data Science Summit 2015 S. Haridi
  • 12.
    Native vs non-nativestreaming 12 Stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch computation } Non-native streaming while (true) { // process next record } Long-standing operators Native streaming Data Science Summit 2015 S. Haridi
  • 13.
    Stream processing inFlink  Continuous Streaming model  Low processing latency  O(1) state updates per operator  Exactly once semantics for state operators Data Science Summit 2015 S. Haridi 13
  • 14.
  • 15.
    Overview of theAPI 15Data Science Summit 2015 S. Haridi
  • 16.
    Windowing Semantics 16 • Triggerand Eviction policies • window(<eviction>).every(<trigger>) • Built-in policies: – Time: Time.of(length, TimeUnit/Custom timestamp) – window(Time.of(20, SECONDS)) – Count: Count.of(windowSize) – window(Count.of(20)).every(Count.of(10)) – Delta: Delta.of(Threshold, Distance function, Start value) – window(Delta.of(0.1, priceDistanceFun, initPrice) Data Science Summit 2015 S. Haridi
  • 17.
    Word count inBatch and Streaming 17 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS)) .every(Time.of(1,SECONDS)).sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming): Data Science Summit 2015 S. Haridi
  • 18.
    Flexible windows 18 More at:http://flink.apache.org/news/2015/02/09/streaming-example.html Keyed Stream Windowed StreamData Stream Keyed Stream Windowed Stream  Stream of stocks  Trigger warning if price fluctuates by 5%  Count the number of warnings per stock in 30 second (tumbling) window  Do it continuously Data Science Summit 2015 S. Haridi Stock Stream Delta 5% of price Warning Count 30 sec window Sum keyBy symbol keyBy symbol
  • 19.
    Flexible windows 19 More at:http://flink.apache.org/news/2015/02/09/streaming-example.html case class Count(symbol: String, count: Int) val defaultPrice = StockPrice(“”, 1000) val priceWarnings = stockStream.keyBy(“symbol”) .window(Delta.of(0.05, priceChange, defaultPrice) .mapWindow(sendWarning _) Use delta policy to create change warnings Count number of warning per stock every half a minute val warningPerStock = priceWarnings.flatten() .map(Count(_, 1)) .keyBy(“symbol”) .window(Time.of(30, SECONDS)) .sum(“count”) Data Science Summit 2015 S. Haridi Stock Stream Delta 5% of price Warning Count 30 sec window Sum keyBy symbol keyBy symbol
  • 20.
    Iterative stream processing 20 Motivation Many applications require cyclic streams  Machine learning applications (parallel model training, evaluation) Iterations in Flink Streaming  Native support for cyclic dataflows  Integrated with functional API  High performance and expressivity Input Train Evaluate Data Science Summit 2015 S. Haridi
  • 21.
  • 22.
    Exactly-once processing infor operator state 22  Based on consistent global snapshots  Low runtime overhead, stateful exactly- once semantics Data Science Summit 2015 S. Haridi
  • 23.
    Checkpointing / Recovery 23 Detailedalgorithm: Lightweight Asynchronous Snapshots for Distributed Dataflows Data Science Summit 2015 S. Haridi
  • 24.
    Fault tolerance  Check-pointingand recovery of operator state is very fast • Data processing does not block  Executions based on CPU/operator time are not idempotent  Other execution modes are based on timestamps of input streams (Event/Ingress time) • Allows idempotent executions • End-to-End exactly-once semantics • In Flink version 0.10 24Data Science Summit 2015 S. Haridi
  • 25.
    Streaming in ApacheFlink  True streaming over stateful distributed dataflow engine  Expressive Streaming API in Java/Scala • Flexible window semantics • Iterative computation  Low streaming latency, exactly-once semantics depending on execution mode, and low overhead for recovery 25Data Science Summit 2015 S. Haridi
  • 26.
    Special Thanks to GyulaFora, SICS Paris Carbone, KTH Kostas Tzoumas, Data Artisans Stephan Ewen, Data Artisans Volker Markl, TU-Berlin 26Data Science Summit 2015