4. 3 Parts of a Streaming Infrastructure
4
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
Data Science Summit 2015 S. Haridi
5. Example: Bouygues Telecom
5Data Science Summit 2015 S. Haridi
• Network and subscriber data
gathered
• Added to Broker in raw format
• Transformed and analyzed by
streaming engine
• Stored back for further procesing
http://data-artisans.com/flink-at-bouygues.html
7. 1 year of Flink - code
April 2014 April 2015
Data Science Summit 2015 S. Haridi 7
8. What is Apache Flink
8
Distributed Data Flow Processing System
▪Focused on large-scale data analytics
▪Unified real-time stream and batch processing
▪Expressive and rich APIs in Java / Scala (+ Python)
▪Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
Data Science Summit 2015 S. Haridi
9. Flink Stack
9
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Cluster Yarn
Tez
Embedded
Dataflow
Dataflow
Table
Streaming dataflow
runtime
Storm
Zeppelin
Data Science Summit 2015 S. Haridi
11. What is Flink Streaming
11
Native, low-latency stream processor
Expressive functional API
Flexible operator state, iterations, windows
Exactly-once processing semantics
Data Science Summit 2015 S. Haridi
12. Native vs non-native streaming
12
Stream
discretizer
Job Job Job Jobwhile (true) {
// get next few records
// issue batch computation
}
Non-native streaming
while (true) {
// process next record
}
Long-standing
operators
Native streaming
Data Science Summit 2015 S. Haridi
13. Stream processing in Flink
Continuous Streaming model
Low processing latency
O(1) state updates per operator
Exactly once semantics for state
operators
Data Science Summit 2015 S. Haridi 13
17. Word count in Batch and Streaming
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word”).window(Time.of(5,SECONDS))
.every(Time.of(1,SECONDS)).sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Data Science Summit 2015 S. Haridi
18. Flexible windows
18
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Keyed Stream Windowed StreamData Stream Keyed Stream Windowed Stream
Stream of stocks
Trigger warning if price fluctuates by 5%
Count the number of warnings per stock in
30 second (tumbling) window
Do it continuously
Data Science Summit 2015 S. Haridi
Stock
Stream
Delta 5%
of price
Warning Count
30 sec
window Sum
keyBy
symbol
keyBy
symbol
19. Flexible windows
19
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
case class Count(symbol: String, count:
Int)
val defaultPrice = StockPrice(“”, 1000)
val priceWarnings =
stockStream.keyBy(“symbol”)
.window(Delta.of(0.05, priceChange,
defaultPrice)
.mapWindow(sendWarning _)
Use delta policy to create
change warnings
Count number of warning
per stock every half a minute
val warningPerStock = priceWarnings.flatten()
.map(Count(_, 1))
.keyBy(“symbol”)
.window(Time.of(30, SECONDS))
.sum(“count”)
Data Science Summit 2015 S. Haridi
Stock
Stream
Delta 5%
of price
Warning Count
30 sec
window
Sum
keyBy
symbol
keyBy
symbol
20. Iterative stream processing
20
Motivation
Many applications require cyclic
streams
Machine learning applications (parallel
model training, evaluation)
Iterations in Flink Streaming
Native support for cyclic dataflows
Integrated with functional API
High performance and expressivity
Input
Train
Evaluate
Data Science Summit 2015 S. Haridi
22. Exactly-once processing in for operator
state
22
Based on consistent global snapshots
Low runtime overhead, stateful exactly-
once semantics
Data Science Summit 2015 S. Haridi
24. Fault tolerance
Check-pointing and recovery of operator state
is very fast
• Data processing does not block
Executions based on CPU/operator time are
not idempotent
Other execution modes are based on
timestamps of input streams (Event/Ingress
time)
• Allows idempotent executions
• End-to-End exactly-once semantics
• In Flink version 0.10
24Data Science Summit 2015 S. Haridi
25. Streaming in Apache Flink
True streaming over stateful distributed
dataflow engine
Expressive Streaming API in Java/Scala
• Flexible window semantics
• Iterative computation
Low streaming latency, exactly-once
semantics depending on execution
mode, and low overhead for recovery
25Data Science Summit 2015 S. Haridi
26. Special Thanks to
Gyula Fora, SICS
Paris Carbone, KTH
Kostas Tzoumas, Data Artisans
Stephan Ewen, Data Artisans
Volker Markl, TU-Berlin
26Data Science Summit 2015