Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It

Stream Processing as a
Foundational Paradigm and
Apache Flink's approach to it
Stephan Ewen, Apache Flink PMC, CTO @ data Artisans

Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
Hint: you already have streaming data
4

Streaming Subsumes Batch
5
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition

6
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
Stream (low latency)
Stream (high latency)

7
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
Stream (low latency)
Batch
(bounded stream)
Stream (high latency)

Stream Processing Decouples
8
Database
(State)
App a App b
App c
App a
App b
App c
Applications build their own stateState managed centralized

Time Travel
9
Process a period of
historic data
partition
partition
Process latest data
with low latency
(tail of the log)
Reprocess stream
(historic data first, catches up with realtime data)

10
But why has it started so recently?
Stream Processing is taking off.
(just look at this year's talks)

11
Latency
Volume/
Throughput
State &
Accuracy
The combination is what makes
steaming powerful
Only recently available together

12
Latency
Volume/
Throughput
State &
Accuracy
Exactly-once semantics
Event time processing
10s of millions evts/sec
for stateful applications
Latency down to
the milliseconds
Apache Flink was the first open-source
system to eliminate these tradeoffs

Flink's Approach
13
Stateful Steam Processing
Fluent API, Windows, Event Time
Table API
Stream SQL
Core API
Declarative DSL
High-level Language
Building Block

14
Source
Filter /
Transform
State
read/write
Sink

15
Scalable embedded state
Access at memory speed &
scales with parallel operators

16
Re-load state
Reset positions
in input streams
Rolling back computation
Re-processing

17
Restore to different
programs
Bugfixes, Upgrades, A/B testing, etc

Versioning the state of applications
18
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C
Time
Savepoint

Flink's Approach
19
Fluent API, Windows, Event Time
Table API
Stream SQL
Core API
Declarative DSL
High-level Language
Building Block

Event Time / Out-of-Order
20
1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time

(Stream) SQL & Table API
21
Table API
// convert stream into Table
val sensorTable: Table = sensorData
.toTable(tableEnv, 'location, 'time, 'tempF)
// define query on Table
val avgTempCTable: Table = sensorTable
.groupBy('location)
.window(Tumble over 1.days on 'rowtime as 'w)
.select('w.start as 'day, 'location,
(('tempF.avg - 32) * 0.556) as 'avgTempC)
.where('location like "room%")
SQL
sensorTable.sql("""
SELECT day, location,
avg((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%'
GROUP BY day, location
""")

What can you do with that?
22
10 billion events (2TB) processed daily across multiple
Flink jobs for the telco network control center.
Ad-hoc realtime queries, > 30 operators, processing
30 billion events daily, maintaining state of 100s of GB
inside Flink with exactly-once guarantees
Jobs with > 20 operators, runs on > 5000 vCores in
1000-node cluster, processes millions of events per
second

Flink's Streams playing at Batch
23
TeraSort
Relational Join
Classic Batch Jobs
Graph
Processing
Linear
Algebra

24
Streaming Technology is already awesome,
but what are the next steps?
A.k.a, what can we expect in the "next gen" ?
A lot of things are "next gen" when looking
at the program, so here is my take on it…

"Next Gen"
26
Elastic Parallelism
Maintaining exactly-once
state consistency
No extra effort for the user
No need to carefully plan
partitions

"Next Gen"
27
Terabytes of state inside the
stream processor
Maintaining fast checkpoints and recovery
E.g., long histories of windows, large join tables
State at local memory speed

"Next Gen"
28
Full SQL on Streams
Continuous queries, incremental results
Windows, event time, processing time
Consistent with SQL on bounded data

We are hiring!
data-artisans.com/careers

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It

More Related Content

What's hot

Viewers also liked

Similar to Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It

More from Ververica

Recently uploaded

Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink's Approach to It