Introduction to Apache Flink at Vienna Meet Up

Introduction to Apache Flink
A native Streaming Data Flow Engine
Stefan Papp
21.10.2015 – The day of Marty McFlys Arrival

2 #Streaming
Streaming is the biggest thing since Hadoop

3
Streaming
Process data immediately at event
time.
Consequences
• Data is processed immediately
• The act of processing data is
more repetitive
Batch Processing
Process collected data at
scheduled time or when a
sufficient amount of data has
been accumulated.
Consequences
• More transactions processed at
one time in a single process
• Higher processing time
#Stream vs Batch
Batch Processing vs. Streaming

4
Past
• Insufficent technologies for streaming,
focus on batch
• Some technologies were not real streaming,
but only microbatches
• Either batch or streaming, but no engine that can do both
Now
• Technologies have matured
• Streaming is highly demanded in business
Streaming – The challenges in the past
#Stream

6 #Streaming
The Focus moves from Storage to Processing

7
Technology Stack
Storage Layer
General Purpose Processing Engine
SQL Engine Abstraction
Engine
ML Graph Streaming

8
Technology Stack with Technologies
Hadoop, S3,...
Flink, Spark
Hive,
SparkSQL
Cascading,
Pig
FlinkML,
MLLib
Gelly,
GraphX
Flink,
Spark
Streaming

9
Old Style Batch Processing: MapReduce
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}

10
Optimized Execution
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Worker
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow Graph
Independent of
batch or
streaming job
deploy
operators
track
intermediate
results

11
Old Style Streaming (Micro Batches)
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}

12
Streaming Topology
#Streaming
Data
Source 1
Data
Source 2
Sprout
Sprout
Bolt
BoltBolt
Bolt
Target
Topology

13 #Streaming
Apache Flink is a Native Streaming GPPE

14
The Flink Ecosystem in a Nutshell
14
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime

15
Native workload support
#workload
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
Graph Analysis

16
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some mutable state
4. Operate on managed memory
Flink Engine – Core Features

17
3 Parts of a Streaming Infrastructure
#Streaming 17
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs

18
• Batch programs are a special kind of streaming program
Batch on Streaming
Infinite Streams Finite Streams
Stream
Windows
Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs

19
Expressive APIs
19
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):

20
Table API
20
val customers = env.readCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)
.filter( o => dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders
.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as revenue")
val result = items
.groupBy("orderId, orderDate, shipPrio")
.select('orderId, revenue.sum, orderDate, shipPrio")

Introduction to Apache Flink at Vienna Meet Up

More Related Content

What's hot

Similar to Introduction to Apache Flink at Vienna Meet Up

Recently uploaded

Introduction to Apache Flink at Vienna Meet Up

Editor's Notes