Stateful stream processing made easy with Apache Flink

Stateful stream processing made
easy with Apache Flink
A. Mancini - F. Tosi
ROME - APRIL 13/14 2018
1

STATEFUL STREAM PROCESSING
MADE EASY WITH
APACHE FLINK
Alberto Mancini
alberto@k-teq.com
Francesca Tosi
francesca@k-teq.com
2

Alberto Mancini
Software Architect
K-TEQ Srls
alberto@k-teq.com
#Flink #Java #GWT #Web
3
WHO WE ARE
Francesca Tosi
Software Architect
K-TEQ Srls
francesca@k-teq.com
#Java #Flink #Web #GWT

Apache Flink is an open source stream
processing framework developed by the
Apache Software Foundation.
The core of Apache Flink is a distributed
streaming dataflow engine written in
Java and Scala.
5
TL;DR

Flink's pipelined runtime system enables
the execution of bulk/batch and stream
processing programs.
6
TL;DR
- Native Stream
- Low Latency
- High Throughput
- Stateful
- Exactly-one guarantees
- Distributed
- Expressive Apis
- …
Main Features

7
TL;DR
https://github.com/apache/flink
#contributors +377
#commits +13.565
#committers over 25
#linesOdCode +1.257.949
Flink conferences (at the
beginning of this week
Flink Forward in San Francisco,
5th Int’l Flink conference).

8
TL;DR 2017 - Year in Review
#contributors #stars #forks

9
TL;DR 2017 - Year in Review
#LinesOfCode

10
TL;DR
https://github.com/apache/flink
#contributors +25
#commits +13.565
#committers over 25
#linesOdCode +1.257.949
Flink conferences (at the
beginning of this week
Flink Forward in San Francisco,
5th Int’l Flink conference).

A bit of History
Actually flink was born as a sort of spin-off
from the project Stratosphere.
“Stratosphere is a research project whose
goal is to develop the next generation Big
Data Analytics platform.
aimed at Next Generation Big Data
Analytics Platform
The project includes universities from the
area of Berlin, namely, TU Berlin, Humboldt
University and the Hasso Plattner Institute.” 12
TL;DR

13
TL;DR (Flink use cases)
uses Flink for real-time
process monitoring and ETL.
Telefónica NEXT's TÜV-certified
Data Anonymization Platform
is powered by Flink.
uses a fork of Flink called Blink
to optimize search rankings in
real time.

14
TL;DR (Flink use cases)
Ericsson used Flink to build a
real-time anomaly detector
over large infrastructures.
MediaMath uses Flink to power
its real-time reporting
infrastructure.
uses Flink to surface near
real-time intelligence from SaaS
application activity.
https://flink.apache.org/poweredby.html

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//create a kafka source
FlinkKafkaConsumer010<Event> kafkaConsumer
= new FlinkKafkaConsumer010<Event>("topic", new MessageDeserializer(), properties);
DataStreamSource<Event> kafkaStream = env.addSource(kafkaConsumer);
ZookeeperSource zookeeperSource = new ZookeeperSource("Quorum", "/path");
DataStreamSource<Map<String, String>> zookeeperStream = env.addSource(zookeeperSource);
ConnectedStreams<Event, Map<String, String>> connectedStream
= kafkaStream.connect(zookeeperStream);
CoFlatMapFunction<Event, Map<String, String>, Event> coFlatMap = new DoSomethingFlatMap();
SingleOutputStreamOperator<Event> flatmappedStream = connectedStream.flatMap(coFlatMap);
KeyedStream<Event, String> keyedStream = flatmappedStream.keyBy(new MyKeySelector());
SingleOutputStreamOperator<OutputEvent> transformed
= keyedStream.map(new AnotherMapThatIsExecutedInKeyedContext());
transformed.writeUsingOutputFormat( new MyHbaseOutputFormat() );
env.execute("Codemotion Sample");
16
TL;DR (code snippet)

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//create a kafka source
= new FlinkKafkaConsumer010<Event>("topic", new MessageDeserializer(), properties);
DataStreamSource<Event> kafkaStream = env.addSource(kafkaConsumer);
DataStreamSource<Map<String, String>> zookeeperStream = env.addSource(zookeeperSource);
ConnectedStreams<Event, Map<String, String>> connectedStream
= kafkaStream.connect(zookeeperStream);
CoFlatMapFunction<Event, Map<String, String>, Event> coFlatMap = new DoSomethingFlatMap();
SingleOutputStreamOperator<Event> flatmappedStream = connectedStream.flatMap(coFlatMap);
KeyedStream<Event, String> keyedStream = flatmappedStream.keyBy(new MyKeySelector());
SingleOutputStreamOperator<OutputEvent> transformed
= keyedStream.map(new AnotherMapThatIsExecutedInKeyedContext());
transformed.writeUsingOutputFormat( new MyHbaseOutputFormat() );
17

StreamExecutionEnvironment env
= StreamExecutionEnvironment.getExecutionEnvironment();
= new FlinkKafkaConsumer010<>("topic", new MessageDeserializer(), properties);
//start from latest message
kafkaConsumer.setStartFromLatest();
env.addSource(kafkaConsumer)
.connect( env.addSource(zookeeperSource) )
.flatMap( new DoSomethingFlatMap() )
.keyBy( new MyKeySelector() )
.map( new AnotherMapThatIsExecutedInKeyedContext() )
.writeUsingOutputFormat( new MyHbaseOutputFormat() );
18

That's a lot of fun but
let’s start
from the very beginning
…
21

BigData is
…
well, big!
IBM (2014): every day about 2.5 trillion (1018
) of data
bytes are created and 90% of the data has been
created only in the last two years
Each year about EXABYTE (10^18, 2^60) of data.
22

BigData
still grows
From:
https://dvmobile.io/dvmobile-blog/feeling-overwhelmed-by-a-deluge-
of-iot-data-iot-data-analytics-dashboards-can-help 23

Processing
Model
From: https://data-artisans.com/what-is-stream-processing
24

From:
http://dme.rwth-aachen.de/en/research/projects/mapreduce
25
Computing Model (google docet)

From:
http://sqlknowledgebank.blogspot.it/2018/01/azure-data-lake-introductory.html
26
… that eventually can evolve as

From:
https://dvmobile.io/dvmobile-blog/feeling-overwhelmed-by-a-deluge-of-iot-data-iot-
data-analytics-dashboards-can-help
27
BigData still grows

1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,1505661693,0,
1,1505661693,0,1,1505661693,0,1,1505661693,0,1,1505661693,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0
1,354,0,1,354,0,1,354,0,1,354,0,1,354,0,1.000000032,353.9999996,4.59E-06,1.000031776,353.9996187,0.004575383,1.031757072,353.6306448,4.295
83941,2.597515222,346.6197997,34.09504708,5.319895236,344.2626949,22.18829857,1.000000032,353.9999996,0.002142519,353.9999996,4.5
9E-06,0,0,1.000031776,353.9996187,0.067641574,353.9996187,0.004575383,0,0,1.031757072,353.6306448,2.072640685,353.6306448,4.295839
41,0,0,2.597515222,346.6197997,5.839096427,346.6197997,34.09504708,0,0,5.319895236,344.2626949,4.710445687,344.2626949,22.18829857,
0,0,1.000000032,4.980575201,4.23E-07,1.000031776,4.980690887,0.000422029,1.031757072,5.092511423,0.418370441,2.597515222,31.24484876,6
976.104978,5.319895236,58.01651105,12740.63613,1.000000032,353.9999996,0.002142519,353.9999996,4.59E-06,0,0,1.000031776,353.9996187,0.
067641574,353.9996187,0.004575383,0,0,1.031757072,353.6306448,2.072640685,353.6306448,4.29583941,0,0,2.597515222,346.6197997,5.839
096427,346.6197997,34.09504708,0,0,5.319895236,344.2626949,4.710445687,344.2626949,22.18829857,0,0
1.857878541,360.4589798,35.78933753,1.91212736,360.2757326,35.92397154,1.969806657,360.0919684,35.9915418,1.99693884,360.0091976,35.
9999154,1.999693461,360.0009198,35.99999915,1.857878569,360.4589795,35.78934202,1.912156343,360.2754556,35.92848955,2.000604877,
359.8134525,40.39880279,3.589563813,352.0188402,100.0815133,6.318264483,347.7030867,81.6250773,1.857878569,360.4589795,5.982419412,
360.4589795,35.78934202,0,0,1.912156343,360.2754556,5.994037834,360.2754556,35.92848955,0,0,2.000604877,359.8134525,6.356005254,
359.8134525,40.39880279,0,0,3.589563813,352.0188402,10.00407484,352.0188402,100.0815133,0,0,6.318264483,347.7030867,9.034659778,347.
7030867,81.6250773,0,0,1.857878569,2.323596243,6.056225848,1.912156343,2.399071467,6.079503362,2.000604877,2.569134347,6.58053185,3
.589563813,22.55281278,5228.309522,6.318264483,48.84116231,11171.88781,1.857878569,360.4589795,5.982419412,360.4589795,35.78934202,0,
0,1.912156343,360.2754556,5.994037834,360.2754556,35.92848955,0,0,2.000604877,359.8134525,6.356005254,359.8134525,40.39880279,0,
0,3.589563813,352.0188402,10.00407484,352.0188402,100.0815133,0,0,6.318264483,347.7030867,9.034659778,347.7030867,81.6250773,0,0
1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,
0,0,0,1,1505661697,0,1,1505661697,0,1,1505661697,0,1,1505661697,0,1,1505661697,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0
,1,337,0,337,0,0,0
… x 40k lines
https://archive.ics.uci.edu/ml/machine-learning-databases/00442/Danmini_Doorbell/
28
not just big: big and often raw, often senseless

Government
- smarter surveillance: analyze data from vehicles and cameras to alert
law enforcement of potential issues
HealthCare
- proactive treatment: continuously improve care based on
personalized data streams
Finance
- manage risk: continuously monitor trades and calculate derivative
values in real-time
Automotive
- improved quality and functionalities: detect problems sooner and
predict breakdowns
Telco
- processing call data: predictive spam and fraud detection
29
areas of application
cf. use-cases-streaming-analytics

Government
- smarter surveillance: analyze data from vehicles and cameras to alert
law enforcement of potential issues
HealthCare
- proactive treatment: continuously improve care based on
personalized data streams
Finance
- manage risk: continuously monitor trades and calculate derivative
values in real-time
Automotive
- improved quality and functionalities: detect problems sooner and
predict breakdowns
Telco
- processing call data: predictive spam and fraud detection
30
areas of application
cf. use-cases-streaming-analytics

From:
https://www.slideshare.net/ConfluentInc/kafka-and-stream-processing-taking-analytics-realt
ime-mike-spicer
32
process the data as asap

From:
https://www.flickr.com/photos/sheila_sund/15221366308
33
long running (distributed) application

- state management
- fault tolerance and recovery
- performance and scalability
- programming model
- ecosystem
34
Challenges

only a few problems can be solved without keeping some sort of
application state
state is needed for any kind of aggregation or counting
35
State Management

application state
● on a single node (and neglecting
threading) keeping state seems easy,
just keep it in the local memory but
with threads in the picture, even
on a single node, keeping state
consistent requires careful
synchronization;
● on a multi node/multi thread/long running application it may
will end up in a mess.
36
State Management

application state
Classical solution: keep state in an external database (KV-stores are
frequently the best fit).
● yet another system to manage
● yet another bottleneck to avoid
● yet another syncronization point to
care about
37
State Management

Only a few problems can be solved without keeping some sort of
application state
Flink keeps state for your application: synchronize, distribute and even
rescale.
38
State Management

Flink state comes in two flavors.
https://www.slideshare.net/dataArtisans/apache-flink-training-working-with-state
39
State Management

Flink state backends are threefold:
- MemoryStateBackend
holds data internally as objects on the Java heap, then
collected in the JobManager (master)
- FsStateBackend
holds in-flight data in the TaskManager’s memory then on
filesystem (hdfs or s3 for instance)
- RocksDBStateBackend
holds in-flight data in a RocksDB data base per task then the
whole RocksDB data base is stored on disk
40
State Management

From:
https://www.flickr.com/photos/sheila_sund/15221366308
41
Long running distributed stateful application

From:
https://www.wikihow.com/Untangle-a-Newton%27s-Cradle 42

Checkpoints
43
fault tolerance and recovery

Savepoints
“Savepoints are externally stored self-contained checkpoints that you
can use to stop-and-resume or update your Flink programs. They use
Flink’s checkpointing mechanism to create a (non-incremental)
snapshot of the state of your streaming program and write the
checkpoint data and meta data out to an external file system.”
caveat: you must consider serializers evolution for objects stored in the
state
44
fault tolerance and recovery

- event time semantics
- flexible windowing
- metrics and counters
- queryable state
- DataSet API
- Event Processing (CEP)
- Graphs: Gelly
- Machine Learning,
- TABLE API,
- SteamSQL
45
- Map: DataStream → DataStream
- FlatMap: DataStream → DataStream
- Filter: DataStream → DataStream
- KeyBy: DataStream → KeyedStream
- Reduce: KeyedStream (DataStream)
- Fold: KeyedStream (DataStream)
- Union: DataStream* → DataStream
- Connect: DataStream, DataStream →
ConnectedStreams
- CoMap: C’dStreams → DataStream
- CoFlatMap: C’dStreams →
DataStream
- Split: DataStream → SplitStream
- Select: SplitStream → DataStream
- Iterate: DataStream →
IterativeStream →
DataStream
- …
and there’s a lot more to tell

Q. Isn’t easier to use just Kafka ?
A. Well, No.
Kafka, or exactly Kafka Streams API
is a library that any standard Java application can embed and
hence does not attempt to dictate a deployment method;
whereas
Flink is a cluster framework, which means that the framework
takes care of deploying the application
https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
48

Q. Did You write “Real Time” ?
A. Well, yes … actually, i meant …
Rigorously
“Real-time programs must guarantee response
within specified time constraints”
Here Real-time means (as in Cambridge
Dictionary)
“communicated, shown, presented, etc. at the
same time as events actually happen”
49

Q. What about Apache Storm ?
A. There is actually a compatiblility suite that let’s
you
● Run unmodified Storm topologies
● Embed Storm code (spouts and bolts) as
operators inside Flink DataStream programs.
50

Q. any kind of comparison chart ?
https://www.gmv.com/blog_gmv/future-streaming-technologies-apache-flink/
51

Stateful stream processing made easy with Apache Flink

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Stateful stream processing made easy with Apache Flink

Similar to Stateful stream processing made easy with Apache Flink (20)

Recently uploaded

Recently uploaded (20)

Stateful stream processing made easy with Apache Flink