SlideShare a Scribd company logo
Overview of Apache Flink:
the 4 G of Big Data Analytics Frameworks
Hadoop Summit Europe,
Dublin, Ireland.
April 13th, 2016
Slim Baltagi
Director, Enterprise Architecture
Capital One Financial Corporation
2
Agenda
1. How Apache Flink is a multi-purpose Big
Data Analytics Framework?
2. Why streaming analytics are emerging?
3. Why Flink is suitable for real-world
streaming analytics?
4. What are some novel use cases enabled by
Flink?
5. Who is using Flink?
6. Where do you go from here?
3
1. How Apache Flink is a multi-purpose Big Data
Analytics Framework?
1.1. What is Apache Flink Stack?
1.2. Why Apache Flink is the 4G of Big Data
Analytics?
1.3. What are Apache Flink Innovations?
4
1.1. What is Apache Flink Stack?
Gelly
Table
HadoopM/R
Storm
DataSet (Java/Scala/Python)
Batch Processing
DataStream (Java/Scala)
Stream Processing
FlinkML
Local
• Single JVM
• Embedded
• Docker
Cluster
• Standalone
• YARN,
• Mesos (WIP)
Cloud
• Google’s GCE
• Amazon’s EC2
• IBM Docker Cloud, …
ApacheBeam
Cascading
Table
MRQL
Distributed Streaming
Dataflow Engine
Zeppelin
DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE
Files
• Local
• HDFS
• S3, Azure
• Alluxio
Databases
• MongoDB
• HBase
• SQL
…
Streams
• Flume
• Kafka, MapR Streams
• RabbitMQ
…
Batch Optimizer Stream Builder
SAMOA
FlinkCEP
Gelly-Stream
ApacheBeam
5
1.2. Why Apache Flink is the 4G of Big Data Analytics?
 Batch  Batch
 Interactive
 Batch
 Interactive
 Near-Real
Time Streaming
(micro-batches)
 Iterative
processing
 Hybrid
 Interactive
 Real-Time
Streaming +
Real-World
Streaming (out of
order streams,
windowing,
backpressure,
CEP, …)
 Native Iterative
processing
MapReduce Direct Acyclic
Graphs (DAG)
Dataflows
RDD: Resilient
Distributed Datasets
Cyclic Dataflows
1G 2G 3G 4G
6
1.3. What are Apache Flink Innovations?
Apache Flink came with many innovations.
Some of these innovations are influencing quite a few
features in other frameworks such as:
1. Custom memory management and binary
processing in Flink from day one inspired Apache
Spark to so so for its project Tungsten since
version 1.6
• https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
• https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html
2. DataSet API is in Flink since its early days and
inspired Apache Spark to come with its Dataset
API in version 1.6
• https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html
• https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
7
1.3. What are Apache Flink Innovations?
3. Flink’s rich windowing semantics for streaming
Flink supports windows over time, count, or
sessions
Windows can be customized with flexible triggering
conditions, to support sophisticated streaming
patterns.
Flink inspired both Apache Storm (1.0.0 was
released on April 12th , 2016) and Spark streaming
(version 2.0 is expected in May 2016) to start
supporting rich windowing
• https://storm.apache.org/2016/04/12/storm100-released.html
• http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-
matei-zaharia/15
8
1.3. What are Apache Flink Innovations?
Some of Flink innovations are not available in other
open source tools such as:
1. The only hybrid (Real-Time Streaming + Batch)
distributed data processing engine natively
supporting many use cases: Batch, Real-Time
streaming, Machine learning, Graph processing
and Relational queries
2. Native iterations ( Iterate and DeltaIterate)
dramatically boost the performance of Machine
learning and Graph analytics requiring iterations.
9
The only hybrid (Real-Time Streaming + Batch)
open source distributed data processing engine
natively supporting many use cases:
Real-Time stream processing Machine Learning at scale
Graph AnalysisBatch Processing
10
1.3. What are Apache Flink Innovations?
3. Simplicity of configuration: Flink requires no
memory thresholds to configure, no complicated
network configurations, no serializers to be
configured, …
4. Little tuning required: Flink’s optimizer can
choose execution strategies automatically in any
environment.
 According to Mike Olsen, Chief Strategy Officer of
Cloudera Inc. “Spark is too knobby — it has too
many tuning parameters, and they need constant
adjustment as workloads, data volumes, user
counts change.”
Reference: http://vision.cloudera.com/one-platform/
11
1.3. What are Apache Flink Innovations?
5. Full support of Apache Beam (for combination of
Batch and Stream) : event time, sessions, …
References:
• The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, 2015 http://research.google.com/pubs/pub43864.html
• Dataflow/Beam & Spark: A Programming Model Comparison, February
3rd, 2016https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-
comparison
6. Innovations in stream processing: event
time, rich streaming window operations,
savepoints, …
• http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-
part-1/
• http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
12
1.3. What are Apache Flink Innovations?
7. FlinkCEP is the Complex Event Processing library for
Flink. It allows you to easily detect complex event
patterns in a stream of endless data to support better
insight and decision making.
• Introducing Complex Event Processing (CEP) with Apache Flink, Till Rohrmann
April 6, 2016 http://flink.apache.org/news/2016/04/06/cep-monitoring.html
• FlinkCEP - Complex event processing for
Flinkhttps://ci.apache.org/projects/flink/flink-docs-
master/apis/streaming/libs/cep.html
8. Run Legacy Big Data applications on Flink: Preserve
your investment in your legacy Big Data applications by
currently running your legacy code on Flink’s powerful
engine using Hadoop and Storm compatibility layers,
Cascading adapter and probably a Spark adapter in the
future.
13
Run your legacy Big Data applications on Flink
Flink’s MapReduce compatibility layer allows to run legacy Hadoop
MapReduce jobs, reuse Hadoop input and output formats and reuse
functions like Map and Reduce. https://ci.apache.org/projects/flink/flink-docs-
master/apis/batch/hadoop_compatibility.html
Cascading on Flink allows to port existing Cascading-MapReduce
applications to Apache Flink with virtually no code changes.
Expected advantages are performance boost and less resources
consumption. https://github.com/dataArtisans/cascading-flink/tree/release-0.2
Flink is compatible with Apache Storm interfaces and therefore
allows reusing code that was implemented for Storm: Execute
existing Storm topologies using Flink as the underlying engine.
Reuse legacy application code (bolts and spouts) inside Flink
programs. https://ci.apache.org/projects/flink/flink-docs-
master/apis/streaming/storm_compatibility.html
14
Agenda
1. How Apache Flink is a multi-purpose Big
Data Analytics Framework?
2. Why streaming analytics are emerging?
3. Why Flink is suitable for real-world
streaming analytics?
4. What are some novel use cases enabled by
Flink?
5. Who is using Flink?
6. Where do you go from here?
15
2. Why streaming analytics are emerging?
Stonebraker et al. predicted in 2005 that stream
processing is going to become increasingly important
and attributed this to the ‘sensorization of the real
world: everything of material significance on the planet
get ‘sensor-tagged’ and report its state or location in
real time’. Reference: http://cs.brown.edu/~ugur/8rulesSigRec.pdf
I think stream processing is becoming important not only
because of this sensorization of the real world but also
because of the following factors:
1. Data streams
2. Technology
3. Business
4. Customers
16
2. Why streaming analytics are emerging?
CustomersData Streams
Technology Business1
2 3
4
Emergence of
Streaming Analytics
17
2. Why streaming analytics are emerging?
1 Data Streams
 Real-world data is available as series of events that
are continuously produced by a variety of
applications and disparate systems inside and
outside the enterprise. Examples:
• Sensor networks data
• Web logs
• Database transactions
• System logs
• Tweets and social media data in general
• Click streams
• Mobile apps data
18
2. Why streaming analytics are emerging?
2 Technology
Simplified data architecture with Apache Kafka as a
major innovation and backbone of streaming
architectures.
Rapidly maturing open source streaming analytics
tools: Apache Flink, Apache Spark’s Streaming module, Kafka
Streams, Apache Samza, Apache Storm, Apache Nifi…
Cloud services for streaming processing: Google Cloud
Dataflow, Azure Stream Analytics, Amazon Kinesis Streams, IBM
InfoSphere Streams, …
Vendors innovating in this space: Data Artisans,
DataTorrent, Striim, Databricks, MapR, Hortonworks, Confluent,
StreamSets, …
More mobile devices than human beings!
19
2. Why streaming analytics are emerging?
3 Business
Challenges:
 Lag between data creation and actionable insights.
 Web and mobile application growth, new types/sources of data.
 Need of organizations to shift from reactive approach to a more of
a proactive approach to interactions with customers, suppliers
and employees.
Opportunities:
Embracing streaming analytics helps organizations with faster
time to insight, competitive advantages and operational efficiency
in a wide range of verticals.
With streaming analytics, new startups are/will be challenging
established companies. Example: Pay-As-You-Go insurance or
Usage-Based Auto Insurance
Speed is said to have become the new currency of business.
20
2. Why streaming analytics are emerging?
4 Customers
Customers are becoming more and more demanding
for instant responses in the way they are used to in
social networks: Twitter, Facebook, Linkedin, …
Younger generation who grow up with video gaming
and accustomed to real-time interaction are now
themselves a growing class of customers
21
Agenda
1. How Apache Flink is a multi-purpose Big
Data Analytics Framework?
2. Why streaming analytics are emerging?
3. Why Flink is suitable for real-world
streaming analytics?
4. What are some novel use cases enabled by
Flink?
5. Who is using Flink?
6. Where do you go from here?
22
3. Why Flink is suitable for real-world streaming
analytics?
3.1. Flink’s streaming analytics features
3.2. What are some streaming analytics use
cases suitable for Flink?
23
3.1. Flink’s streaming analytics features
Apache Flink 1.0, which was released on March 8th
2016, comes with a competitive set of streaming
analytics features, some of which are unique in the
open source domain.
Apache Flink 1.0.1 was released on April 6th 2016.
The combination of these features makes Apache
Flink a unique choice for real-world streaming
analytics.
Let’s discuss some of Apache Flink features for real-
world streaming analytics.
24
3.1. Flink’s streaming analytics features
1. Pipelined processing engine
2. Stream abstraction: DataStream as in the real-world
3. Performance: Low latency and high throughput
4. Support for rich windowing semantics
5. Support for different notions of time
6. Stateful stream processing
7. Fault tolerance and correctness
8. High Availability
9. Backpressure handling
10. Expressive and easy-to-use APIs in Scala and Java
11. Support for batch
12. Integration with the Hadoop ecosystem
25
1. Pipelined processing engine
 Flink is a pipelined (streaming) engine akin to parallel
database systems, rather than a batch engine as
Spark.
 ‘Flink’s runtime is not designed around the idea that
operators wait for their predecessors to finish before
they start, but they can already consume partially
generated results.’
 ‘This is called pipeline parallelism and means that
several transformations in a Flink program are
actually executed concurrently with data being
passed between them through memory and network
channels.’ http://data-artisans.com/apache-flink-new-kid-on-the-
block/
26
2. Stream abstraction: DataStream as in the real-
world
 Real world data is a series of events that are
continuously produced by a variety of applications and
disparate systems inside and outside the enterprise.
 Flink, as a stream processing system, models streams
as what they are in the real world, a series of events
and use DataStream as an abstraction.
 Spark, as a batch processing system, approximates
these streams as micro-batches and uses DStream as
an abstraction. This adds an artificial latency!
27
3. Performance: Low latency and high throughput
Pipelined processing engine enable true low latency
streaming applications with fast results in milliseconds
High throughput: efficiently handle high volume of
streams (millions of events per second)
Tunable latency / throughput tradeoff: Using a tuning
knob to navigate the latency-throughput trade off.
Yahoo! benchmarked Storm, Spark Streaming and Flink
with a production use-case (counting ad impressions
grouped by campaign).
Full Yahoo! Article, benchmark stops at low write
throughput and programs are not fault tolerant.
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
28
3. Performance: Low latency and high throughput
Full Data Artisans article, extends the Yahoo!
benchmark to high volumes and uses Flink’s built-in
state http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
Flink outperformed both Spark Streaming and Storm
in this benchmark modeled after a real-world
application:
• Flink achieves throughput of 15 million messages/second on a
10 machines cluster. This is 35x higher throughput compared to
Storm (80x compared to Yahoo’s runs)
• Flink ran with exactly once guarantees, Storm with at least
once.
Ultimately, you need to test the performance of your
own streaming analytics application as it depends on
your own logic and the version of your preferred
stream processing tool!
29
4. Support for rich windowing semantics
Flink provides rich windowing semantics. A window is
a grouping of events based on some function of time
(all records of the last 5 minutes), count (the last 10
events) or session (all the events of a particular web
user ).
Window types in Flink:
• Tumbling windows ( no overlap)
• Sliding windows (with overlap)
• Session windows ( gap of activity)
• Custom windows (with assigners, triggers and
evictors)
30
4. Support for rich windowing semantics
In many systems, these windows are hard-coded and
connected with the system’s internal checkpointing
mechanism. Flink is the first open source streaming
engine that completely decouples windowing from
fault tolerance, allowing for richer forms of windows,
such as sessions.
Further reading:
• http://flink.apache.org/news/2015/12/04/Introducing-windows.html
• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
31
5. Support for different notions of time
In a streaming program with Flink, for example to define
windows in respect to time, one can refer to different
notions of time:
• Event Time: when an event did happen in the real
world.
• Ingestion time: when data is loaded into Flink, from
Kafka for example.
• Processing Time: when data is processed by Flink
In the real word, streams of events rarely arrive in the
order that they are produced due to distributed sources,
non-synced clocks, network delays… They are said to be
“out of order’ streams.
Flink is the first open source streaming engine that
supports out of order streams and which is able to
consistently process events according to their event
time.
32
5. Support for different notions of time
http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#time
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/event_time.html
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
33
6. Stateful stream processing
Many operations in a dataflow simply look at one
individual event at a time, for example an event parser.
Some operations called stateful operations are defined as
the ones where data is needed to be stored at the end of a
window for computations occurring in later windows.
Now, where the state of these stateful operations is
maintained?
34
6. Stateful stream processing
 The state can be stored in memory in the File System
or in RocksDB which is an embedded key value data
store and not an external database.
 Flink also supports state versioning through
savepoints which are checkpoints of the state of a
running streaming job that can be manually triggered
by the user while the job is running.
 Savepoints enable:
• Code upgrades: both application and framework
• Cluster maintenance and migration
• A/B testing and what-if scenarios
• Testing and debugging.
• Restart a job with adjusted parallelism
Further reading: http://data-artisans.com/how-apache-flink-enables-new-streaming-
applications/
 https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
35
7. Fault tolerance and correctness
How to ensure that the state is correct after failures?
Apache Flink offers a fault tolerance mechanism to
consistently recover the state of data streaming
applications.
This ensures that even in the presence of failures, the
operators do not perform duplicate updates to their
state (exactly once guarantees). This basically means
that the computed results are the same whether there
are failures along the way or not.
There is a switch to downgrade the guarantees to at
least once if the use case tolerates duplicate updates.
36
7. Fault tolerance and correctness
Further reading:
• High-throughput, low-latency, and exactly-once stream
processing with Apache Flinkhttp://data-artisans.com/high-
throughput-low-latency-and-exactly-once-stream-processing-with-apache-
flink/
• Data Streaming Fault Tolerance document:
http://ci.apache.org/projects/flink/flink-docs-
master/internals/stream_checkpointing.html
• ‘Lightweight Asynchronous Snapshots for Distributed
Dataflows’ http://arxiv.org/pdf/1506.08603v1.pdf June 28, 2015
• Distributed Snapshots: Determining Global States of
Distributed Systems, February 1985, Chandra-Lamport
algorithm http://research.microsoft.com/en-
us/um/people/lamport/pubs/chandy.pdf
37
8. High Availability
In the real world, streaming analytics applications need
to be reliable and capable of running jobs for months
and remain resilient in the event of failures.
The JobManager (Master) is responsible for scheduling
and resource management. If it crashes, no new
programs can be submitted and running program will
fail.
Flink provides a High Availability (HA) mode to recover
from JobManager crash, to eliminate the Single Point
Of Failure (SPOF)
Further reading: JobManager High Availability
https://ci.apache.org/projects/flink/flink-docs-
master/setup/jobmanager_high_availability.html
38
9. Backpressure handling
In the real world, there are situations where a system is
receiving data at a higher rate than it can normally
process. This is called backpressure.
Flink handles backpressure implicitly through its
architecture without user interaction while
backpressure handling in Spark is through manual
configuration: spark.streaming.backpressure.enabled.
Flink provides backpressure monitoring to allow users
to understand bottlenecks in streaming applications.
Further reading:
• How Flink handles backpressure? by Ufuk Celebi, Kostas Tzoumas and
Stephan Ewen, August 31, 2015. http://data-artisans.com/how-flink-handles-
backpressure/
39
10. Expressive and easy-to-use APIs in Scala and Java
 High level, expressive and easy to use DataStream API
with flexible window semantics results in significantly
less custom application logic compared to other open
source stream processing solutions.
 Flink's DataStream API ports many operators from its
DataSet batch processing API such as map, reduce, and
join to the streaming world.
 In addition, it provides stream-specific operations such
as window, split, connect, …
 Its support for user-defined functions eases the
implementation of custom application behavior.
 The DataStream API is available in Scala and Java.
40
10. Expressive and easy-to-use APIs in Scala and Java
case class Word (word: String, frequency: Int)
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.keyBy("word").sum("frequency")
.print()
env.execute()
val env = ExecutionEnvironment.getExecutionEnvironment()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
env.execute()
DataSet API (batch): WordCount
DataStream API (streaming): Window WordCount
41
11. Support for batch
 In Flink, batch processing is a special case of stream
processing, as finite data sources are just streams that
happen to end.
 Flink offers a full toolset for batch processing with a
dedicated DataSet API and libraries for machine learning
and graph processing.
 In addition, Flink contains several batch-specific
optimizations such as for scheduling, memory
management, and query optimization.
 Flink out-performs dedicated batch processing engine
such as Spark and Hadoop MapReduce in batch use
cases.
42
12. Integration with the Hadoop ecosystem
POSIX Java/Scala
Collections
POSIX
43
3.2 What are some streaming analytics use cases
suitable for Flink?
1. Financial services
2. Telecommunications
3. Online gaming systems
4. Security & Intelligence
5. Advertisement serving
6. Sensor Networks
7. Social Media
8. Healthcare
9. Oil & Gas
10. Retail & eCommerce
11. Transportation and logistics
44
Agenda
1. How Apache Flink is a multi-purpose Big
Data Analytics Framework?
2. Why streaming analytics are emerging?
3. Why Flink is suitable for real-world
streaming analytics?
4. What are some novel use cases enabled by
Flink?
5. Who is using Flink?
6. Where do you go from here?
45
4. What are some novel use cases enabled by
Flink?
4.1. Flink as an imbedded key/value data store
4.2. Flink as a distributed CEP engine
46
4.1. Flink as an imbedded key/value data store
 The stream processor as a database: a new design pattern for data
streaming applications, using Apache Flink and Apache Kafka:
Building applications directly on top of the stream processor, rather
than on top of key/value databases populated by data streams.
 The stateful operator features in Flink allow a streaming application
to query state in the stream processor instead of a key/value store
often a bottleneck http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
47
“State querying” feature is expected in upcoming Flink 1.1
http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed/38
48
4.2. Flink as a distributed CEP engine
Flink stream processor as CEP (Complex Event
Processing) engine. Example: an application that
ingests network monitoring events, identifies access
patterns such as intrusion attempts using FlinkCEP, and
analyzes and aggregates identified access patterns.
Upcoming Talk: Streaming analytics and CEP - Two sides of the
same coin’ by Till Rohrmann and Fabian Hueske at the Berlin
Buzzwords on June 05-07 2016.
http://berlinbuzzwords.de/session/streaming-analytics-and-cep-two-sides-same-coin
Further reading:
– Introducing Complex Event Processing (CEP) with Apache Flink,
Till Rohrmann April 6, 2016 http://flink.apache.org/news/2016/04/06/cep-
monitoring.html
– FlinkCEP - Complex event processing for
Flinkhttps://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html
49
Agenda
1. How Apache Flink is a multi-purpose Big
Data Analytics Framework?
2. Why streaming analytics are emerging?
3. Why Flink is suitable for real-world
streaming analytics?
4. What are some novel use cases enabled by
Flink?
5. Who is using Flink?
6. Where do you go from here?
50
5. Who is using Flink? . Who is using Apache
Flink?
Some companies using Flink for streaming analytics:
[Telecommunications] [Retail] [Financial Services]
Gaming Security
[Gaming] [Security]
Powered by Flink
pagehttps://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink
51
5. Who is using Flink?
 has its hack week and the winner, announced
on December 18th 2015, was a Flink based streaming project!
Extending the Yahoo! Streaming Benchmark and Winning Twitter
Hack-Week with Apache Flink. Posted on February 2, 2016 by
Jamie Grier http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed
 did some benchmarks to compare
performance of one of their use case originally implemented on
Apache Storm against Spark Streaming and Flink. Results posted
on December 18, 2015
• http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-
at
• http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
• https://github.com/dataArtisans/yahoo-streaming-benchmark
• http://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark
52
Generic Streaming Analytics Architectural pattern:
This is changing with Flink’s alerts, StreamSQL, state
querying, FlinkCEP, …
Event
Producers
Collector
Broker
Processor
Indexer
Visualizer/Search
• Kafka
• RabitMQ
• JMS
• Amazon
Kinesis
• Google Cloud
Pub/Sub
• MapR Streams
• Flink
• Spark
• Storm
• Samza
• Kafka
streams
• ElasticSearch
• Solr
• Cassandra
• HBase
• MapR DB
• MongoDB
• Apache Geode
• Kibana
• Custom
GUI
• Flume
• SpringXD
• Logstash
• Nifi
• Fluentd
• Apps
• Devices
• Sensors
53
Agenda
1. How Apache Flink is a multi-purpose Big
Data Analytics Framework?
2. Why streaming analytics are emerging?
3. Why Flink is suitable for real-world
streaming analytics?
4. What are some novel use cases enabled by
Flink?
5. Who is using Flink?
6. Where do you go from here?
54
6. Where do you go from here?
 A few resources for you:
• Flink Knowledge Base: One-Stop for everything
related to Apache Flink. By Slim
Baltagihttp://sparkbigdata.com/component/tags/tag/27-flink
• Flink at the Apache Software Foundation: flink.apache.org/
• Free Apache Flink training from data Artisans
http://dataartisans.github.io/flink-training
• Flink Forward Conference, 12-14 September 2016,
Berlin, Germany http://flink-forward.org/ (call for submissions
announced today April 13th , 2016!)
• Free ebook from MapR: Streaming Architecture: New
Designs Using Apache Kafka and MapR Streams
https://www.mapr.com/streaming-architecture-using-apache-kafka-mapr-
streams
55
6. Where do you go from here?
 A few takeaways:
• Apache Flink unique capabilities enable new and
sophisticated use cases especially for real-world
streaming analytics.
• Customers demand will push major Hadoop distributors
to package Flink and support it.
• What would be the 5G of Big Data Analytics platforms?
Guiding principles would be Unification, Simplification
and Ease of use:
GUI to build batch and streaming applications
Unified API for batch and streaming
Single engine for batch and streaming
Unified storage layer (files, streams, NoSQL)
Unified query engine for SQL, NoSQL and structured
streams
56
Thanks!
To all of you for attending!
Let’s keep in touch!
• sbaltagi@gmail.com
• @SlimBaltagi
• https://www.linkedin.com/in/slimbaltagi
Any questions?

More Related Content

What's hot

Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
Robert Metzger
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
Gyula Fóra
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
The Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data ProcessingThe Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
Gyula Fóra
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-intro
Leo Lahti
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 

What's hot (20)

Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
The Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data ProcessingThe Evolution of (Open Source) Data Processing
The Evolution of (Open Source) Data Processing
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-intro
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 

Viewers also liked

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
PosterDigital SpinetiX Tutorial: How to connect your SpinetiX
PosterDigital SpinetiX Tutorial: How to connect your SpinetiXPosterDigital SpinetiX Tutorial: How to connect your SpinetiX
PosterDigital SpinetiX Tutorial: How to connect your SpinetiX
PosterDigital
 
PosterDigital: Starter guide English
PosterDigital: Starter guide EnglishPosterDigital: Starter guide English
PosterDigital: Starter guide English
PosterDigital
 
Latte art ok
Latte art okLatte art ok
Latte art okMokasirs
 
Pestaña inicio
Pestaña inicioPestaña inicio
Pestaña inicio
Edu Tec
 
HIV Powerprint Presentation final review
HIV Powerprint Presentation final reviewHIV Powerprint Presentation final review
HIV Powerprint Presentation final reviewShirlgandy Saint Jean
 
Ville de pekin
Ville de pekinVille de pekin
Ville de pekin
F ztm
 
PosterDigital AMX Tutorial: How to connect your AMX player
PosterDigital AMX Tutorial: How to connect your AMX playerPosterDigital AMX Tutorial: How to connect your AMX player
PosterDigital AMX Tutorial: How to connect your AMX player
PosterDigital
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Slim Baltagi
 
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
Nawanan Theera-Ampornpunt
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduce
mudassar mulla
 

Viewers also liked (16)

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
PosterDigital SpinetiX Tutorial: How to connect your SpinetiX
PosterDigital SpinetiX Tutorial: How to connect your SpinetiXPosterDigital SpinetiX Tutorial: How to connect your SpinetiX
PosterDigital SpinetiX Tutorial: How to connect your SpinetiX
 
PPT-1
PPT-1PPT-1
PPT-1
 
PosterDigital: Starter guide English
PosterDigital: Starter guide EnglishPosterDigital: Starter guide English
PosterDigital: Starter guide English
 
Latte art ok
Latte art okLatte art ok
Latte art ok
 
presentation brand
presentation brandpresentation brand
presentation brand
 
Pestaña inicio
Pestaña inicioPestaña inicio
Pestaña inicio
 
HIV Powerprint Presentation final review
HIV Powerprint Presentation final reviewHIV Powerprint Presentation final review
HIV Powerprint Presentation final review
 
Ronnie Mathews1
Ronnie Mathews1Ronnie Mathews1
Ronnie Mathews1
 
Ville de pekin
Ville de pekinVille de pekin
Ville de pekin
 
PosterDigital AMX Tutorial: How to connect your AMX player
PosterDigital AMX Tutorial: How to connect your AMX playerPosterDigital AMX Tutorial: How to connect your AMX player
PosterDigital AMX Tutorial: How to connect your AMX player
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
มาตรฐานการป้องกันความลับของข้อมูลผู้ป่วย (23 มี.ค. 2559)
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduce
 

Similar to Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 
Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
sureshraj43
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
Timothy Spann
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
Henry Saputra
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Robert Metzger
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
Robert Metzger
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 

Similar to Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks (20)

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 
Data streaming
Data streamingData streaming
Data streaming
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 

Recently uploaded

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

  • 1. Overview of Apache Flink: the 4 G of Big Data Analytics Frameworks Hadoop Summit Europe, Dublin, Ireland. April 13th, 2016 Slim Baltagi Director, Enterprise Architecture Capital One Financial Corporation
  • 2. 2 Agenda 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 2. Why streaming analytics are emerging? 3. Why Flink is suitable for real-world streaming analytics? 4. What are some novel use cases enabled by Flink? 5. Who is using Flink? 6. Where do you go from here?
  • 3. 3 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 1.1. What is Apache Flink Stack? 1.2. Why Apache Flink is the 4G of Big Data Analytics? 1.3. What are Apache Flink Innovations?
  • 4. 4 1.1. What is Apache Flink Stack? Gelly Table HadoopM/R Storm DataSet (Java/Scala/Python) Batch Processing DataStream (Java/Scala) Stream Processing FlinkML Local • Single JVM • Embedded • Docker Cluster • Standalone • YARN, • Mesos (WIP) Cloud • Google’s GCE • Amazon’s EC2 • IBM Docker Cloud, … ApacheBeam Cascading Table MRQL Distributed Streaming Dataflow Engine Zeppelin DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE Files • Local • HDFS • S3, Azure • Alluxio Databases • MongoDB • HBase • SQL … Streams • Flume • Kafka, MapR Streams • RabbitMQ … Batch Optimizer Stream Builder SAMOA FlinkCEP Gelly-Stream ApacheBeam
  • 5. 5 1.2. Why Apache Flink is the 4G of Big Data Analytics?  Batch  Batch  Interactive  Batch  Interactive  Near-Real Time Streaming (micro-batches)  Iterative processing  Hybrid  Interactive  Real-Time Streaming + Real-World Streaming (out of order streams, windowing, backpressure, CEP, …)  Native Iterative processing MapReduce Direct Acyclic Graphs (DAG) Dataflows RDD: Resilient Distributed Datasets Cyclic Dataflows 1G 2G 3G 4G
  • 6. 6 1.3. What are Apache Flink Innovations? Apache Flink came with many innovations. Some of these innovations are influencing quite a few features in other frameworks such as: 1. Custom memory management and binary processing in Flink from day one inspired Apache Spark to so so for its project Tungsten since version 1.6 • https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html • https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark- closer-to-bare-metal.html 2. DataSet API is in Flink since its early days and inspired Apache Spark to come with its Dataset API in version 1.6 • https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html • https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
  • 7. 7 1.3. What are Apache Flink Innovations? 3. Flink’s rich windowing semantics for streaming Flink supports windows over time, count, or sessions Windows can be customized with flexible triggering conditions, to support sophisticated streaming patterns. Flink inspired both Apache Storm (1.0.0 was released on April 12th , 2016) and Spark streaming (version 2.0 is expected in May 2016) to start supporting rich windowing • https://storm.apache.org/2016/04/12/storm100-released.html • http://www.slideshare.net/databricks/2016-spark-summit-east-keynote- matei-zaharia/15
  • 8. 8 1.3. What are Apache Flink Innovations? Some of Flink innovations are not available in other open source tools such as: 1. The only hybrid (Real-Time Streaming + Batch) distributed data processing engine natively supporting many use cases: Batch, Real-Time streaming, Machine learning, Graph processing and Relational queries 2. Native iterations ( Iterate and DeltaIterate) dramatically boost the performance of Machine learning and Graph analytics requiring iterations.
  • 9. 9 The only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine natively supporting many use cases: Real-Time stream processing Machine Learning at scale Graph AnalysisBatch Processing
  • 10. 10 1.3. What are Apache Flink Innovations? 3. Simplicity of configuration: Flink requires no memory thresholds to configure, no complicated network configurations, no serializers to be configured, … 4. Little tuning required: Flink’s optimizer can choose execution strategies automatically in any environment.  According to Mike Olsen, Chief Strategy Officer of Cloudera Inc. “Spark is too knobby — it has too many tuning parameters, and they need constant adjustment as workloads, data volumes, user counts change.” Reference: http://vision.cloudera.com/one-platform/
  • 11. 11 1.3. What are Apache Flink Innovations? 5. Full support of Apache Beam (for combination of Batch and Stream) : event time, sessions, … References: • The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, 2015 http://research.google.com/pubs/pub43864.html • Dataflow/Beam & Spark: A Programming Model Comparison, February 3rd, 2016https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark- comparison 6. Innovations in stream processing: event time, rich streaming window operations, savepoints, … • http://data-artisans.com/how-apache-flink-enables-new-streaming-applications- part-1/ • http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
  • 12. 12 1.3. What are Apache Flink Innovations? 7. FlinkCEP is the Complex Event Processing library for Flink. It allows you to easily detect complex event patterns in a stream of endless data to support better insight and decision making. • Introducing Complex Event Processing (CEP) with Apache Flink, Till Rohrmann April 6, 2016 http://flink.apache.org/news/2016/04/06/cep-monitoring.html • FlinkCEP - Complex event processing for Flinkhttps://ci.apache.org/projects/flink/flink-docs- master/apis/streaming/libs/cep.html 8. Run Legacy Big Data applications on Flink: Preserve your investment in your legacy Big Data applications by currently running your legacy code on Flink’s powerful engine using Hadoop and Storm compatibility layers, Cascading adapter and probably a Spark adapter in the future.
  • 13. 13 Run your legacy Big Data applications on Flink Flink’s MapReduce compatibility layer allows to run legacy Hadoop MapReduce jobs, reuse Hadoop input and output formats and reuse functions like Map and Reduce. https://ci.apache.org/projects/flink/flink-docs- master/apis/batch/hadoop_compatibility.html Cascading on Flink allows to port existing Cascading-MapReduce applications to Apache Flink with virtually no code changes. Expected advantages are performance boost and less resources consumption. https://github.com/dataArtisans/cascading-flink/tree/release-0.2 Flink is compatible with Apache Storm interfaces and therefore allows reusing code that was implemented for Storm: Execute existing Storm topologies using Flink as the underlying engine. Reuse legacy application code (bolts and spouts) inside Flink programs. https://ci.apache.org/projects/flink/flink-docs- master/apis/streaming/storm_compatibility.html
  • 14. 14 Agenda 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 2. Why streaming analytics are emerging? 3. Why Flink is suitable for real-world streaming analytics? 4. What are some novel use cases enabled by Flink? 5. Who is using Flink? 6. Where do you go from here?
  • 15. 15 2. Why streaming analytics are emerging? Stonebraker et al. predicted in 2005 that stream processing is going to become increasingly important and attributed this to the ‘sensorization of the real world: everything of material significance on the planet get ‘sensor-tagged’ and report its state or location in real time’. Reference: http://cs.brown.edu/~ugur/8rulesSigRec.pdf I think stream processing is becoming important not only because of this sensorization of the real world but also because of the following factors: 1. Data streams 2. Technology 3. Business 4. Customers
  • 16. 16 2. Why streaming analytics are emerging? CustomersData Streams Technology Business1 2 3 4 Emergence of Streaming Analytics
  • 17. 17 2. Why streaming analytics are emerging? 1 Data Streams  Real-world data is available as series of events that are continuously produced by a variety of applications and disparate systems inside and outside the enterprise. Examples: • Sensor networks data • Web logs • Database transactions • System logs • Tweets and social media data in general • Click streams • Mobile apps data
  • 18. 18 2. Why streaming analytics are emerging? 2 Technology Simplified data architecture with Apache Kafka as a major innovation and backbone of streaming architectures. Rapidly maturing open source streaming analytics tools: Apache Flink, Apache Spark’s Streaming module, Kafka Streams, Apache Samza, Apache Storm, Apache Nifi… Cloud services for streaming processing: Google Cloud Dataflow, Azure Stream Analytics, Amazon Kinesis Streams, IBM InfoSphere Streams, … Vendors innovating in this space: Data Artisans, DataTorrent, Striim, Databricks, MapR, Hortonworks, Confluent, StreamSets, … More mobile devices than human beings!
  • 19. 19 2. Why streaming analytics are emerging? 3 Business Challenges:  Lag between data creation and actionable insights.  Web and mobile application growth, new types/sources of data.  Need of organizations to shift from reactive approach to a more of a proactive approach to interactions with customers, suppliers and employees. Opportunities: Embracing streaming analytics helps organizations with faster time to insight, competitive advantages and operational efficiency in a wide range of verticals. With streaming analytics, new startups are/will be challenging established companies. Example: Pay-As-You-Go insurance or Usage-Based Auto Insurance Speed is said to have become the new currency of business.
  • 20. 20 2. Why streaming analytics are emerging? 4 Customers Customers are becoming more and more demanding for instant responses in the way they are used to in social networks: Twitter, Facebook, Linkedin, … Younger generation who grow up with video gaming and accustomed to real-time interaction are now themselves a growing class of customers
  • 21. 21 Agenda 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 2. Why streaming analytics are emerging? 3. Why Flink is suitable for real-world streaming analytics? 4. What are some novel use cases enabled by Flink? 5. Who is using Flink? 6. Where do you go from here?
  • 22. 22 3. Why Flink is suitable for real-world streaming analytics? 3.1. Flink’s streaming analytics features 3.2. What are some streaming analytics use cases suitable for Flink?
  • 23. 23 3.1. Flink’s streaming analytics features Apache Flink 1.0, which was released on March 8th 2016, comes with a competitive set of streaming analytics features, some of which are unique in the open source domain. Apache Flink 1.0.1 was released on April 6th 2016. The combination of these features makes Apache Flink a unique choice for real-world streaming analytics. Let’s discuss some of Apache Flink features for real- world streaming analytics.
  • 24. 24 3.1. Flink’s streaming analytics features 1. Pipelined processing engine 2. Stream abstraction: DataStream as in the real-world 3. Performance: Low latency and high throughput 4. Support for rich windowing semantics 5. Support for different notions of time 6. Stateful stream processing 7. Fault tolerance and correctness 8. High Availability 9. Backpressure handling 10. Expressive and easy-to-use APIs in Scala and Java 11. Support for batch 12. Integration with the Hadoop ecosystem
  • 25. 25 1. Pipelined processing engine  Flink is a pipelined (streaming) engine akin to parallel database systems, rather than a batch engine as Spark.  ‘Flink’s runtime is not designed around the idea that operators wait for their predecessors to finish before they start, but they can already consume partially generated results.’  ‘This is called pipeline parallelism and means that several transformations in a Flink program are actually executed concurrently with data being passed between them through memory and network channels.’ http://data-artisans.com/apache-flink-new-kid-on-the- block/
  • 26. 26 2. Stream abstraction: DataStream as in the real- world  Real world data is a series of events that are continuously produced by a variety of applications and disparate systems inside and outside the enterprise.  Flink, as a stream processing system, models streams as what they are in the real world, a series of events and use DataStream as an abstraction.  Spark, as a batch processing system, approximates these streams as micro-batches and uses DStream as an abstraction. This adds an artificial latency!
  • 27. 27 3. Performance: Low latency and high throughput Pipelined processing engine enable true low latency streaming applications with fast results in milliseconds High throughput: efficiently handle high volume of streams (millions of events per second) Tunable latency / throughput tradeoff: Using a tuning knob to navigate the latency-throughput trade off. Yahoo! benchmarked Storm, Spark Streaming and Flink with a production use-case (counting ad impressions grouped by campaign). Full Yahoo! Article, benchmark stops at low write throughput and programs are not fault tolerant. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at
  • 28. 28 3. Performance: Low latency and high throughput Full Data Artisans article, extends the Yahoo! benchmark to high volumes and uses Flink’s built-in state http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ Flink outperformed both Spark Streaming and Storm in this benchmark modeled after a real-world application: • Flink achieves throughput of 15 million messages/second on a 10 machines cluster. This is 35x higher throughput compared to Storm (80x compared to Yahoo’s runs) • Flink ran with exactly once guarantees, Storm with at least once. Ultimately, you need to test the performance of your own streaming analytics application as it depends on your own logic and the version of your preferred stream processing tool!
  • 29. 29 4. Support for rich windowing semantics Flink provides rich windowing semantics. A window is a grouping of events based on some function of time (all records of the last 5 minutes), count (the last 10 events) or session (all the events of a particular web user ). Window types in Flink: • Tumbling windows ( no overlap) • Sliding windows (with overlap) • Session windows ( gap of activity) • Custom windows (with assigners, triggers and evictors)
  • 30. 30 4. Support for rich windowing semantics In many systems, these windows are hard-coded and connected with the system’s internal checkpointing mechanism. Flink is the first open source streaming engine that completely decouples windowing from fault tolerance, allowing for richer forms of windows, such as sessions. Further reading: • http://flink.apache.org/news/2015/12/04/Introducing-windows.html • http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html • https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  • 31. 31 5. Support for different notions of time In a streaming program with Flink, for example to define windows in respect to time, one can refer to different notions of time: • Event Time: when an event did happen in the real world. • Ingestion time: when data is loaded into Flink, from Kafka for example. • Processing Time: when data is processed by Flink In the real word, streams of events rarely arrive in the order that they are produced due to distributed sources, non-synced clocks, network delays… They are said to be “out of order’ streams. Flink is the first open source streaming engine that supports out of order streams and which is able to consistently process events according to their event time.
  • 32. 32 5. Support for different notions of time http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#time https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/event_time.html http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
  • 33. 33 6. Stateful stream processing Many operations in a dataflow simply look at one individual event at a time, for example an event parser. Some operations called stateful operations are defined as the ones where data is needed to be stored at the end of a window for computations occurring in later windows. Now, where the state of these stateful operations is maintained?
  • 34. 34 6. Stateful stream processing  The state can be stored in memory in the File System or in RocksDB which is an embedded key value data store and not an external database.  Flink also supports state versioning through savepoints which are checkpoints of the state of a running streaming job that can be manually triggered by the user while the job is running.  Savepoints enable: • Code upgrades: both application and framework • Cluster maintenance and migration • A/B testing and what-if scenarios • Testing and debugging. • Restart a job with adjusted parallelism Further reading: http://data-artisans.com/how-apache-flink-enables-new-streaming- applications/  https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
  • 35. 35 7. Fault tolerance and correctness How to ensure that the state is correct after failures? Apache Flink offers a fault tolerance mechanism to consistently recover the state of data streaming applications. This ensures that even in the presence of failures, the operators do not perform duplicate updates to their state (exactly once guarantees). This basically means that the computed results are the same whether there are failures along the way or not. There is a switch to downgrade the guarantees to at least once if the use case tolerates duplicate updates.
  • 36. 36 7. Fault tolerance and correctness Further reading: • High-throughput, low-latency, and exactly-once stream processing with Apache Flinkhttp://data-artisans.com/high- throughput-low-latency-and-exactly-once-stream-processing-with-apache- flink/ • Data Streaming Fault Tolerance document: http://ci.apache.org/projects/flink/flink-docs- master/internals/stream_checkpointing.html • ‘Lightweight Asynchronous Snapshots for Distributed Dataflows’ http://arxiv.org/pdf/1506.08603v1.pdf June 28, 2015 • Distributed Snapshots: Determining Global States of Distributed Systems, February 1985, Chandra-Lamport algorithm http://research.microsoft.com/en- us/um/people/lamport/pubs/chandy.pdf
  • 37. 37 8. High Availability In the real world, streaming analytics applications need to be reliable and capable of running jobs for months and remain resilient in the event of failures. The JobManager (Master) is responsible for scheduling and resource management. If it crashes, no new programs can be submitted and running program will fail. Flink provides a High Availability (HA) mode to recover from JobManager crash, to eliminate the Single Point Of Failure (SPOF) Further reading: JobManager High Availability https://ci.apache.org/projects/flink/flink-docs- master/setup/jobmanager_high_availability.html
  • 38. 38 9. Backpressure handling In the real world, there are situations where a system is receiving data at a higher rate than it can normally process. This is called backpressure. Flink handles backpressure implicitly through its architecture without user interaction while backpressure handling in Spark is through manual configuration: spark.streaming.backpressure.enabled. Flink provides backpressure monitoring to allow users to understand bottlenecks in streaming applications. Further reading: • How Flink handles backpressure? by Ufuk Celebi, Kostas Tzoumas and Stephan Ewen, August 31, 2015. http://data-artisans.com/how-flink-handles- backpressure/
  • 39. 39 10. Expressive and easy-to-use APIs in Scala and Java  High level, expressive and easy to use DataStream API with flexible window semantics results in significantly less custom application logic compared to other open source stream processing solutions.  Flink's DataStream API ports many operators from its DataSet batch processing API such as map, reduce, and join to the streaming world.  In addition, it provides stream-specific operations such as window, split, connect, …  Its support for user-defined functions eases the implementation of custom application behavior.  The DataStream API is available in Scala and Java.
  • 40. 40 10. Expressive and easy-to-use APIs in Scala and Java case class Word (word: String, frequency: Int) val env = StreamExecutionEnvironment.getExecutionEnvironment() val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .keyBy("word").sum("frequency") .print() env.execute() val env = ExecutionEnvironment.getExecutionEnvironment() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() env.execute() DataSet API (batch): WordCount DataStream API (streaming): Window WordCount
  • 41. 41 11. Support for batch  In Flink, batch processing is a special case of stream processing, as finite data sources are just streams that happen to end.  Flink offers a full toolset for batch processing with a dedicated DataSet API and libraries for machine learning and graph processing.  In addition, Flink contains several batch-specific optimizations such as for scheduling, memory management, and query optimization.  Flink out-performs dedicated batch processing engine such as Spark and Hadoop MapReduce in batch use cases.
  • 42. 42 12. Integration with the Hadoop ecosystem POSIX Java/Scala Collections POSIX
  • 43. 43 3.2 What are some streaming analytics use cases suitable for Flink? 1. Financial services 2. Telecommunications 3. Online gaming systems 4. Security & Intelligence 5. Advertisement serving 6. Sensor Networks 7. Social Media 8. Healthcare 9. Oil & Gas 10. Retail & eCommerce 11. Transportation and logistics
  • 44. 44 Agenda 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 2. Why streaming analytics are emerging? 3. Why Flink is suitable for real-world streaming analytics? 4. What are some novel use cases enabled by Flink? 5. Who is using Flink? 6. Where do you go from here?
  • 45. 45 4. What are some novel use cases enabled by Flink? 4.1. Flink as an imbedded key/value data store 4.2. Flink as a distributed CEP engine
  • 46. 46 4.1. Flink as an imbedded key/value data store  The stream processor as a database: a new design pattern for data streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams.  The stateful operator features in Flink allow a streaming application to query state in the stream processor instead of a key/value store often a bottleneck http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
  • 47. 47 “State querying” feature is expected in upcoming Flink 1.1 http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed/38
  • 48. 48 4.2. Flink as a distributed CEP engine Flink stream processor as CEP (Complex Event Processing) engine. Example: an application that ingests network monitoring events, identifies access patterns such as intrusion attempts using FlinkCEP, and analyzes and aggregates identified access patterns. Upcoming Talk: Streaming analytics and CEP - Two sides of the same coin’ by Till Rohrmann and Fabian Hueske at the Berlin Buzzwords on June 05-07 2016. http://berlinbuzzwords.de/session/streaming-analytics-and-cep-two-sides-same-coin Further reading: – Introducing Complex Event Processing (CEP) with Apache Flink, Till Rohrmann April 6, 2016 http://flink.apache.org/news/2016/04/06/cep- monitoring.html – FlinkCEP - Complex event processing for Flinkhttps://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html
  • 49. 49 Agenda 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 2. Why streaming analytics are emerging? 3. Why Flink is suitable for real-world streaming analytics? 4. What are some novel use cases enabled by Flink? 5. Who is using Flink? 6. Where do you go from here?
  • 50. 50 5. Who is using Flink? . Who is using Apache Flink? Some companies using Flink for streaming analytics: [Telecommunications] [Retail] [Financial Services] Gaming Security [Gaming] [Security] Powered by Flink pagehttps://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink
  • 51. 51 5. Who is using Flink?  has its hack week and the winner, announced on December 18th 2015, was a Flink based streaming project! Extending the Yahoo! Streaming Benchmark and Winning Twitter Hack-Week with Apache Flink. Posted on February 2, 2016 by Jamie Grier http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ http://www.slideshare.net/JamieGrier/stateful-stream-processing-at-inmemory-speed  did some benchmarks to compare performance of one of their use case originally implemented on Apache Storm against Spark Streaming and Flink. Results posted on December 18, 2015 • http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines- at • http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ • https://github.com/dataArtisans/yahoo-streaming-benchmark • http://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark
  • 52. 52 Generic Streaming Analytics Architectural pattern: This is changing with Flink’s alerts, StreamSQL, state querying, FlinkCEP, … Event Producers Collector Broker Processor Indexer Visualizer/Search • Kafka • RabitMQ • JMS • Amazon Kinesis • Google Cloud Pub/Sub • MapR Streams • Flink • Spark • Storm • Samza • Kafka streams • ElasticSearch • Solr • Cassandra • HBase • MapR DB • MongoDB • Apache Geode • Kibana • Custom GUI • Flume • SpringXD • Logstash • Nifi • Fluentd • Apps • Devices • Sensors
  • 53. 53 Agenda 1. How Apache Flink is a multi-purpose Big Data Analytics Framework? 2. Why streaming analytics are emerging? 3. Why Flink is suitable for real-world streaming analytics? 4. What are some novel use cases enabled by Flink? 5. Who is using Flink? 6. Where do you go from here?
  • 54. 54 6. Where do you go from here?  A few resources for you: • Flink Knowledge Base: One-Stop for everything related to Apache Flink. By Slim Baltagihttp://sparkbigdata.com/component/tags/tag/27-flink • Flink at the Apache Software Foundation: flink.apache.org/ • Free Apache Flink training from data Artisans http://dataartisans.github.io/flink-training • Flink Forward Conference, 12-14 September 2016, Berlin, Germany http://flink-forward.org/ (call for submissions announced today April 13th , 2016!) • Free ebook from MapR: Streaming Architecture: New Designs Using Apache Kafka and MapR Streams https://www.mapr.com/streaming-architecture-using-apache-kafka-mapr- streams
  • 55. 55 6. Where do you go from here?  A few takeaways: • Apache Flink unique capabilities enable new and sophisticated use cases especially for real-world streaming analytics. • Customers demand will push major Hadoop distributors to package Flink and support it. • What would be the 5G of Big Data Analytics platforms? Guiding principles would be Unification, Simplification and Ease of use: GUI to build batch and streaming applications Unified API for batch and streaming Single engine for batch and streaming Unified storage layer (files, streams, NoSQL) Unified query engine for SQL, NoSQL and structured streams
  • 56. 56 Thanks! To all of you for attending! Let’s keep in touch! • sbaltagi@gmail.com • @SlimBaltagi • https://www.linkedin.com/in/slimbaltagi Any questions?