SICS: Apache Flink Streaming

Introduction to stream processing
with Apache Flink
Seif Haridi
KTH/SICS

Stream processing
2Data Science Summit 2015

Why streaming
3
Data
Warehouse
Batch
Data availability Streaming
2008 20152000
- Which data?
- When?
- Who?
Data Science Summit 2015 S. Haridi

3 Parts of a Streaming Infrastructure
4
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs

Example: Bouygues Telecom
5Data Science Summit 2015 S. Haridi
• Network and subscriber data
gathered
• Added to Broker in raw format
• Transformed and analyzed by
streaming engine
• Stored back for further procesing
http://data-artisans.com/flink-at-bouygues.html

What is Apache Flink?

1 year of Flink - code
April 2014 April 2015
Data Science Summit 2015 S. Haridi 7

What is Apache Flink
8
Distributed Data Flow Processing System
▪Focused on large-scale data analytics
▪Unified real-time stream and batch processing
▪Expressive and rich APIs in Java / Scala (+ Python)
▪Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source

Flink Stack
9
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Cluster Yarn
Tez
Embedded
Dataflow
Dataflow
Table
Streaming dataflow
runtime
Storm
Zeppelin

Stream Processing with Flink

What is Flink Streaming
11
 Native, low-latency stream processor
 Expressive functional API
 Flexible operator state, iterations, windows
 Exactly-once processing semantics

Native vs non-native streaming
12
Stream
discretizer
Job Job Job Jobwhile (true) {
// get next few records
// issue batch computation
}
Non-native streaming
while (true) {
// process next record
}
Long-standing
operators
Native streaming

Stream processing in Flink
 Continuous Streaming model
 Low processing latency
 O(1) state updates per operator
 Exactly once semantics for state
operators
Data Science Summit 2015 S. Haridi 13

DataStream API

Overview of the API

Windowing Semantics
16
• Trigger and Eviction policies
• window(<eviction>).every(<trigger>)
• Built-in policies:
– Time: Time.of(length, TimeUnit/Custom timestamp)
– window(Time.of(20, SECONDS))
– Count: Count.of(windowSize)
– window(Count.of(20)).every(Count.of(10))
– Delta: Delta.of(Threshold, Distance function, Start value)
– window(Delta.of(0.1, priceDistanceFun, initPrice)

Word count in Batch and Streaming
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.keyBy("word”).window(Time.of(5,SECONDS))
.every(Time.of(1,SECONDS)).sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):

Flexible windows
18
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Keyed Stream Windowed StreamData Stream Keyed Stream Windowed Stream
 Stream of stocks
 Trigger warning if price fluctuates by 5%
 Count the number of warnings per stock in
30 second (tumbling) window
 Do it continuously
Stock
Stream
Delta 5%
of price
Warning Count
30 sec
window Sum
keyBy
symbol
keyBy
symbol

Flexible windows
19
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
case class Count(symbol: String, count:
Int)
val defaultPrice = StockPrice(“”, 1000)
val priceWarnings =
stockStream.keyBy(“symbol”)
.window(Delta.of(0.05, priceChange,
defaultPrice)
.mapWindow(sendWarning _)
Use delta policy to create
change warnings
Count number of warning
per stock every half a minute
val warningPerStock = priceWarnings.flatten()
.map(Count(_, 1))
.keyBy(“symbol”)
.window(Time.of(30, SECONDS))
.sum(“count”)
Stock
Stream
Delta 5%
of price
Warning Count
30 sec
window
Sum
keyBy
symbol
keyBy
symbol

Iterative stream processing
20
Motivation
 Many applications require cyclic
streams
 Machine learning applications (parallel
model training, evaluation)
Iterations in Flink Streaming
 Native support for cyclic dataflows
 Integrated with functional API
 High performance and expressivity
Input
Train
Evaluate

Fault tolerance

Exactly-once processing in for operator
state
22
 Based on consistent global snapshots
 Low runtime overhead, stateful exactly-
once semantics

Checkpointing / Recovery
23
Detailed algorithm: Lightweight Asynchronous Snapshots for Distributed Dataflows

Fault tolerance
 Check-pointing and recovery of operator state
is very fast
• Data processing does not block
 Executions based on CPU/operator time are
not idempotent
 Other execution modes are based on
timestamps of input streams (Event/Ingress
time)
• Allows idempotent executions
• End-to-End exactly-once semantics
• In Flink version 0.10

Streaming in Apache Flink
 True streaming over stateful distributed
dataflow engine
 Expressive Streaming API in Java/Scala
• Flexible window semantics
• Iterative computation
 Low streaming latency, exactly-once
semantics depending on execution
mode, and low overhead for recovery

Special Thanks to
Gyula Fora, SICS
Paris Carbone, KTH
Kostas Tzoumas, Data Artisans
Stephan Ewen, Data Artisans
Volker Markl, TU-Berlin

SICS: Apache Flink Streaming

More Related Content

What's hot

Viewers also liked

Similar to SICS: Apache Flink Streaming

More from Turi, Inc.

Recently uploaded

SICS: Apache Flink Streaming