5. Apache Flink is an open source stream
processing framework developed by the
Apache Software Foundation.
The core of Apache Flink is a distributed
streaming dataflow engine written in
Java and Scala.
5
TL;DR
6. Flink's pipelined runtime system enables
the execution of bulk/batch and stream
processing programs.
6
TL;DR
- Native Stream
- Low Latency
- High Throughput
- Stateful
- Exactly-one guarantees
- Distributed
- Expressive Apis
- …
Main Features
12. A bit of History
Actually flink was born as a sort of spin-off
from the project Stratosphere.
“Stratosphere is a research project whose
goal is to develop the next generation Big
Data Analytics platform.
aimed at Next Generation Big Data
Analytics Platform
The project includes universities from the
area of Berlin, namely, TU Berlin, Humboldt
University and the Hasso Plattner Institute.” 12
TL;DR
13. 13
TL;DR (Flink use cases)
uses Flink for real-time
process monitoring and ETL.
Telefónica NEXT's TÜV-certified
Data Anonymization Platform
is powered by Flink.
uses a fork of Flink called Blink
to optimize search rankings in
real time.
14. 14
TL;DR (Flink use cases)
Ericsson used Flink to build a
real-time anomaly detector
over large infrastructures.
MediaMath uses Flink to power
its real-time reporting
infrastructure.
uses Flink to surface near
real-time intelligence from SaaS
application activity.
https://flink.apache.org/poweredby.html
21. That's a lot of fun but
let’s start
from the very beginning
…
21
22. BigData is
…
well, big!
IBM (2014): every day about 2.5 trillion (1018
) of data
bytes are created and 90% of the data has been
created only in the last two years
Each year about EXABYTE (10^18, 2^60) of data.
22
29. Government
- smarter surveillance: analyze data from vehicles and cameras to alert
law enforcement of potential issues
HealthCare
- proactive treatment: continuously improve care based on
personalized data streams
Finance
- manage risk: continuously monitor trades and calculate derivative
values in real-time
Automotive
- improved quality and functionalities: detect problems sooner and
predict breakdowns
Telco
- processing call data: predictive spam and fraud detection
29
areas of application
cf. use-cases-streaming-analytics
30. Government
- smarter surveillance: analyze data from vehicles and cameras to alert
law enforcement of potential issues
HealthCare
- proactive treatment: continuously improve care based on
personalized data streams
Finance
- manage risk: continuously monitor trades and calculate derivative
values in real-time
Automotive
- improved quality and functionalities: detect problems sooner and
predict breakdowns
Telco
- processing call data: predictive spam and fraud detection
30
areas of application
cf. use-cases-streaming-analytics
34. - state management
- fault tolerance and recovery
- performance and scalability
- programming model
- ecosystem
34
Challenges
35. only a few problems can be solved without keeping some sort of
application state
state is needed for any kind of aggregation or counting
35
State Management
36. only a few problems can be solved without keeping some sort of
application state
● on a single node (and neglecting
threading) keeping state seems easy,
just keep it in the local memory but
with threads in the picture, even
on a single node, keeping state
consistent requires careful
synchronization;
● on a multi node/multi thread/long running application it may
will end up in a mess.
36
State Management
37. only a few problems can be solved without keeping some sort of
application state
Classical solution: keep state in an external database (KV-stores are
frequently the best fit).
● yet another system to manage
● yet another bottleneck to avoid
● yet another syncronization point to
care about
37
State Management
38. Only a few problems can be solved without keeping some sort of
application state
Flink keeps state for your application: synchronize, distribute and even
rescale.
38
State Management
39. Flink state comes in two flavors.
https://www.slideshare.net/dataArtisans/apache-flink-training-working-with-state
39
State Management
40. Flink state backends are threefold:
- MemoryStateBackend
holds data internally as objects on the Java heap, then
collected in the JobManager (master)
- FsStateBackend
holds in-flight data in the TaskManager’s memory then on
filesystem (hdfs or s3 for instance)
- RocksDBStateBackend
holds in-flight data in a RocksDB data base per task then the
whole RocksDB data base is stored on disk
40
State Management
44. Savepoints
“Savepoints are externally stored self-contained checkpoints that you
can use to stop-and-resume or update your Flink programs. They use
Flink’s checkpointing mechanism to create a (non-incremental)
snapshot of the state of your streaming program and write the
checkpoint data and meta data out to an external file system.”
caveat: you must consider serializers evolution for objects stored in the
state
44
fault tolerance and recovery
48. Q. Isn’t easier to use just Kafka ?
A. Well, No.
Kafka, or exactly Kafka Streams API
is a library that any standard Java application can embed and
hence does not attempt to dictate a deployment method;
whereas
Flink is a cluster framework, which means that the framework
takes care of deploying the application
https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
48
49. Q. Did You write “Real Time” ?
A. Well, yes … actually, i meant …
Rigorously
“Real-time programs must guarantee response
within specified time constraints”
Here Real-time means (as in Cambridge
Dictionary)
“communicated, shown, presented, etc. at the
same time as events actually happen”
49
50. Q. What about Apache Storm ?
A. There is actually a compatiblility suite that let’s
you
● Run unmodified Storm topologies
● Embed Storm code (spouts and bolts) as
operators inside Flink DataStream programs.
50
51. Q. any kind of comparison chart ?
https://www.gmv.com/blog_gmv/future-streaming-technologies-apache-flink/
51