SlideShare a Scribd company logo
1 of 67
Overview of Apache Flink:
Next-Gen Big Data
Analytics Framework
By Slim Baltagi @SlimBaltagi
Chicago Apache Flink Meetup
June 30th, 2015
Agenda
I. What is Apache Flink stack and how it fits
into the Big Data ecosystem?
II. How Apache Flink integrates with Apache
Hadoop and other open source tools?
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache Storm
and Apache Spark?
IV. Who is using Apache Flink?
V. Where to learn more about Apache Flink?
VI. What are some Key Takeaways?
2
I. What is Apache Flink stack and how it
fits into the Big Data ecosystem?
1. What is Big Data?
2. What is a typical Big Data Analytics Stack?
3. What is Apache Flink?
4. What is Flink Execution Engine?
5. What are Flink APIs?
6. What are Flink Domain Specific Libraries?
7. How Flink offers Interactive Data Analysis?
8. What is Flink Architecture?
9. What is Flink Programming Model? 3
1. What is Big Data?
“Big Data refers to data sets large enough and data
streams fast enough, from heterogeneous data
sources, that has outpaced our capability to store,
process, analyze, and understand.”
4
2. What is a typical Big Data Analytics
Stack: Hadoop, Spark, Flink, …?
5
3. What is Apache Flink?
 Apache Flink, like Apache Hadoop and Apache
Spark, is a community-driven open source framework
for distributed Big Data Analytics. Apache Flink
engine exploits data streaming and in-memory
processing and iteration operators to improve
performance.
 Apache Flink has its origins in a research project
called Stratosphere of which the idea was conceived
in 2008 by professor Volker Markl from the
Technische Universität Berlin in Germany.
 In German, Flink means agile or swift. Flink joined
the Apache incubator in April 2014 and graduated as
an Apache Top Level Project (TLP) in December 2014.6
3. What is Apache Flink?
The Apache Flink framework, written in Java, provides:
1. Big data processing engine: distributed and
scalable streaming dataflow engine
2. Several APIs in Java/Scala/Python:
• DataSet API – Batch processing
• DataStream API – Real-Time streaming analytics
• Table API - Relational Queries
3. Domain-Specific Libraries:
• FlinkML: Machine Learning Library for Flink
• Gelly: Graph Library for Flink
4. Shell for interactive data analysis
7
• Declarativity
• Query optimization
• Efficient parallel in-
memory and out-of-
core algorithms
• Massive scale-out
• User Defined
Functions
• Complex data types
• Schema on read
• Streaming
• Iterations
• Advanced
Dataflows
• General APIs
Draws on concepts
from
MPP Database
Technology
Draws on concepts
from
Hadoop MapReduce
Technology
Add
Key Vision of Apache Flink
8
What is Apache Flink stack?
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python)
Batch Processing
DataStream (Java/Scala)
Stream Processing
HadoopM/R
Local
Single JVM
Embedded
Cluster
Standalone
YARN, Tez
Cloud
Google’s GCE
Amazon’s EC2
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Runtime
Distributed Streaming
Dataflow
Zeppelin
DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE
Files
Local
HDFS
S3
Tachyon
Databases
MongoDB
HBase
SQL
…
Streams
Flume
Kafka
RabbitMQ
…
Batch Optimizer Stream Builder
9
4. What is Flink Execution Engine?
The core of Flink is a distributed and scalable streaming
dataflow engine with some unique features:
1. True streaming capabilities: Execute everything as
streams
2. Native iterative execution: Allow some cyclic
dataflows
3. Handling of mutable state
4. Custom memory manager: Operate on managed
memory
5. Cost-Based Optimizer: for both batch and stream
processing
10
The only hybrid (Real-Time Streaming + Batch)
open source distributed data processing engine
supporting many use cases:
Real-Time stream processing Machine Learning at scale
Graph AnalysisBatch Processing
11
Why Apache Flink is Next-Gen?
• Batch • Batch
• Interactive
• Batch
• Interactive
• Near-Real time
Streaming
• Iterative
processing
• Batch
• Interactive
• Real-Time
Streaming
• Native Iterative
processing
MapReduce Direct Acyclic
Graphs (DAG)
Dataflows
RDD: Resilient
Distributed
Datasets
Cyclic Dataflows
1st Generation
(1G)
2ndGeneration
(2G)
3rd Generation
(3G)
4th Generation
(4G)
12
5. Flink APIs
5.1 DataSet API for static data - Java, Scala,
and Python
5.2 DataStream API for unbounded real-time
streams - Java and Scala
5.3 Table API for relational queries - Java and
Scala
13
5.1 DataSet API – Batch processing
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
14
5.2 DataStream API – Real-Time Streaming
Analytics
 Many time-critical applications need to process large
streams of live data and provide results in real-time. For
example:
• Fraud detection
• Financial Stock monitoring
• Anomaly detection
• Traffic management applications
• Online recommenders
 Flink Streaming provides high-throughput, low-latency
stateful stream processing system with rich windowing
semantics. It has built-in connectors to many data
sources like Flume, Kafka, Twitter, RabbitMQ 15
5.2 DataStream API – Real-Time Streaming
Analytics
 Still in Beta as of June 24th 2015 ( Flink 0.9 release)
 Data streams can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API.
 Flink Streaming provides native support for iterative
stream processing.
 Streaming Fault-Tolerance added in Flink 0.9 (released
on June 24th , 2015) allows Exactly-once processing
delivery guarantees for Flink streaming programs that
analyze streaming sources persisted by Apache Kafka.
See paper: ‘Lightweight Asynchronous Snapshots for
Distributed Dataflows’ http://arxiv.org/pdf/1506.08603v1.pdf June 28, 2015
16
5.2 DataStream API – Real-Time Streaming
Analytics
 Data Streaming Fault Tolerance document:
http://ci.apache.org/projects/flink/flink-docs-
master/internals/stream_checkpointing.html
Flink being based on a pipelined execution engine akin
to parallel database systems allows:
• to integrate streaming operations with rich
windowing semantics seamlessly
• process streaming operations in a pipelined way with
lower latency than micro-batch architectures and
without the complexity of lambda architectures.
 Flink Streaming web resources at the Flink Knowledge
Base http://sparkbigdata.com/component/tags/tag/49-flink-streaming
17
5.3 Table API – Relational Queries
val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter("mktSegment = AUTOMOBILE")
val orders = env.readCsvFile(…)
.filter( o =>
dateFormat.parse(o.orderDate).before(date) )
.as("orderId, custId, orderDate, shipPrio")
val items = orders
.join(customers).where("custId = id")
.join(lineitems).where("orderId = id")
.select("orderId, orderDate, shipPrio,
extdPrice * (Literal(1.0f) – discount) as
revenue")
val result = items
.groupBy("orderId, orderDate, shipPrio")
.select("orderId, revenue.sum, orderDate, shipPrio")
Table API (queries)
18
5.3 Table API – Relational Queries
 Table API added in February 2015. Still in Beta as of
June 24th 2015 ( Flink 0.9 release)
 Flink provides Table API that allows specifying
operations using SQL-like expressions instead of
manipulating DataSet or DataStream.
 Table API can be used in both batch (on structured
data sets) and streaming programs (on structured data
streams).http://ci.apache.org/projects/flink/flink-docs-
master/libs/table.html
 Flink Table web resources at the Apache Flink
Knowledge Base: http://sparkbigdata.com/component/tags/tag/52-
flink-table
19
6. Flink Domain Specific Libraries
6.1 FlinkML – Machine Learning Library
6.2 Gelly – Graph Analytics for Flink
20
6.1 FlinkML - Machine Learning Library
 FlinkML is the Machine Learning (ML) library for Flink
that was added in March 2015. Still in beta as of June
24th 2015 ( Flink 0.9 release)
 FlinkML aims to provide:
• an intuitive API
• scalable ML algorithms
• tools that help minimize glue code in end-to-end ML
applications
 FlinkML will allow data scientists to:
• test their models locally using subsets of data
• use the same code to run their algorithms at a much
larger scale in a cluster setting.
21
6.1 FlinkML
 FlinkML is inspired by other open source efforts, in
particular:
• scikit-learn for cleanly specifying ML pipelines
• Spark’s MLLib for providing ML algorithms that scale
with cluster size.
 FlinkML unique features are:
1. Exploiting the in-memory data streaming nature of
Flink.
2. Natively executing iterative processing algorithms
which are common in Machine Learning.
3. Streaming ML designed specifically for data
streams.
22
6.1 FlinkML
 Learn more about FlinkML at
http://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
 You can find more details about FlinkML goals and
where it is headed in the vision and roadmap here:
FlinkML: Vision and Roadmap
https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+an
d+Roadmap
 Check more FlinkML web resources at the Apache
Flink Knowledge Base:
http://sparkbigdata.com/component/tags/tag/51-flinkml
 Interested in helping out the Apache Flink project?
Please check: How to contribute? http://flink.apache.org/how-
to-contribute.html
23
6.2 Gelly – Graph Analytics for Flink
 Gelly is a Graph API for Flink. Gelly Java API was
added in February 2015. Gelly Scala API started in May
2015 and is Work In Progress.
 Gelly is still in Beta as of June 24th 2015 ( Flink 0.9
release).
 Gelly provides:
• A set of methods and utilities to create, transform
and modify graphs
• A library of graph algorithms which aims to simplify
the development of graph analysis applications
• Iterative graph algorithms are executed leveraging
mutable state
24
6.2 Gelly – Graph Analytics for Flink
 Large-scale graph processing with Apache Flink -
Vasia Kalavri, February 1st,
2015http://www.slideshare.net/vkalavri/largescale-graph-processing-with-apache-
flink-graphdevroom-fosdem15
 Graph streaming model and API on top of Flink
streaming and provides similar interfaces to Gelly –
Janos Daniel Balo, June 30, 2015http://kth.diva-
portal.org/smash/get/diva2:830662/FULLTEXT01.pdf
 Check out more Gelly web resources at the Apache
Flink Knowledge
Base:http://sparkbigdata.com/component/tags/tag/50-gelly
 Interested in helping out the Apache Flink project
Please check: How to contribute? http://flink.apache.org/how-to-
contribute.html 25
7. How Flink offers Interactive Data
Analysis?
Interactive Shell - Scala REPL ( Read Eval
Print Loop ) :
• ./bin/start-scala-shell.sh
• Complete Scala API available
• Syntax completion
• Build your jobs incrementally
• Upcoming: Caching of intermediate results
(WIP)
26
7. How Flink offers Interactive Data
Analysis?
 Apache Zeppelin: collaborative data analytics and
visualization tool
27
8. What is Flink Architecture?
 Flink implements the Kappa Architecture:
run batch programs on a streaming system.
 References about the Kappa Architecture:
• Questioning the Lambda Architecture - Jay Kreps ,
July 2nd, 2014 http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html
• Turning the database inside out with Apache
Samza -Martin Kleppmann, March 4th, 2015
o http://www.youtube.com/watch?v=fU9hR3kiOK0 (VIDEO)
o http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-
out.html(TRANSCRIPT)
o http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-
apache-samza/ (BLOG)
28
8. What is Flink Architecture?
8.1 Client
8.2 Master (Job Manager)
8.3 Worker (Task Manager)
29
8.1 Client
 Type extraction
 Optimize: in all APIs not just SQL queries as in Spark
 Construct job Dataflow graph
 Pass job Dataflow graph to job manager
 Retrieve job results
Job Manager
Client
case class Path (from: Long, to: Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type
extraction
Data Source
orders.tbl
Filter
Map
DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part
[0] hash-part [0]
GroupRed
sort
forward
30
8.2 Job Manager (JM)
 Parallelization: Create Execution Graph
 Scheduling: Assign tasks to task managers
 State tracking: Supervise the execution
 JM High Availability is still a Work In Progress as of
Flink 0.9
https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability
Job Manager
Data Source
orders.tbl
Filter
Map
DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0]
hash-part [0]
GroupRed
sort
forward
Task
Manager
Task
Manager
Task
Manager
Task
Manager
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Data
Source
orders.tbl
Filter
Map
DataSour
ce
lineitem.tbl
Join
Hybrid Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
Data
Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid
Hash
build
HT
prob
e
hash-part [0] hash-part [0]
GroupRed
sort
forwar
d
31
8.3 Task Manager ( TM)
 Operations are split up into tasks depending on the
specified parallelism
 Each parallel instance of an operation runs in a
separate task slot
 The scheduler may run several tasks from different
operators in one task slot
Task Manager
S
l
o
t
Task ManagerTask Manager
S
l
o
t
S
l
o
t
32
9. What is Flink Programming Model?
 DataSet and DataStream as programming
abstractions are the foundation for user programs
and higher layers.
 Flink extends the MapReduce model with new
operators that represent many common data analysis
tasks more naturally and efficiently.
 All operators will start working in memory and
gracefully go out of core under memory pressure.
33
9.1 DataSet
• Central notion of the programming API
• Files and other data sources are read into DataSets
– DataSet<String> text = env.readTextFile(…)
• Transformations on DataSets produce DataSets
– DataSet<String> first = text.map(…)
• DataSets are printed to files or on stdout
– first.writeAsCsv(…)
• Execution is triggered with env.execute()
34
9.1 DataSet
Used for Batch Processing
Data
Set
Operation
Data
Set
Source
Example: Map and Reduce operation
Sink
b h
2 1
3 5
7 4
… …
Map Reduce
a
1
2
…
35
9.2 DataStream
Real-time event streams
Data
Stream
Operation
Data
Stream
Source Sink
Stock Feed
Name Price
Microsoft 124
Google 516
Apple 235
… …
Alert if
Microsoft
> 120
Write
event to
database
Sum
every 10
seconds
Alert if
sum >
10000
Microsoft 124
Google 516
Apple 235
Microsoft 124
Google 516
Apple 235
Example: Stream from a live stock feed
36
Agenda
I. What is Apache Flink stack and how it fits
into the Big Data ecosystem?
II. How Apache Flink integrates with Apache
Hadoop and other open source tools?
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache
Storm and Apache Spark?
IV. Who is using Apache Flink?
V. Where to learn more about Apache Flink?
VI. What are some key takeaways?
37
II. How Apache Flink integrates with
Hadoop and other open source tools?
 Flink integrates well with other open source tools for
data input and output as well as deployment.
 Hadoop integration out of the box:
• HDFS to read and write. Secure HDFS support
• Deploy inside of Hadoop via YARN
• Reuse data types (that implement Writables
interface)
 YARN Setup http://ci.apache.org/projects/flink/flink-docs-
master/setup/yarn_setup.html
 YARN Configuration
http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#yarn
38
II. How Apache Flink integrates with
Hadoop and other open source tools?
Hadoop Compatibility in Flink by Fabian Hüske -
November 18, 2014 http://flink.apache.org/news/2014/11/18/hadoop-
compatibility.html
Hadoop integration with a thin wrapper (Hadoop
Compatibility layer) to run legacy Hadoop MapReduce
jobs, reuse Hadoop input and output formats and
reuse functions like Map and Reduce.
39
II. How Apache Flink integrates with
Hadoop and other open source tools?
Service Open Source Tool
Storage/Servi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
40
II. How Apache Flink integrates with
Hadoop and other open source tools?
 Apache Flink + Apache SAMOA for Machine Learning
on streams http://samoa.incubator.apache.org/
 Flink Integrates with Zeppelin
http://zeppelin.incubator.apache.org/
 Flink on Apache Tez
http://tez.apache.org/
 Flink + Apache MRQL http://mrql.incubator.apache.org
 Flink + Tachyon
http://tachyon-project.org/
Running Apache Flink on Tachyon http://tachyon-project.org/Running-
Flink-on-Tachyon.html
 XtreemFS http://www.xtreemfs.org/
41
Agenda
I. What is Apache Flink stack and how it fits
into the Big Data ecosystem?
II. How Apache Flink integrates with Apache
Hadoop and other open source tools?
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache Storm
and Apache Spark?
IV. Who is using Apache Flink?
V. Where to learn more about Apache Flink?
VI. What are some key takeaways?
42
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache
Storm and Apache Spark?
1. Why Apache Flink is an alternative to
Apache Hadoop MapReduce?
2. Why Apache Flink is an alternative to
Apache Storm?
3. Why Apache Flink is an alternative to
Apache Spark?
43
1. Why Apache Flink is an alternative to
Apache Hadoop MapReduce?
 Flink offers cyclic dataflows compared to the two-
stage, disk-based MapReduce paradigm.
 The application programming interface (API) for Flink
is easier to use than programming for Hadoop’s
MapReduce.
 Flink is easier to test compared to MapReduce.
 Flink can leverage in-memory processing, data
streaming and iteration operators for faster data
processing speed.
 Flink can work on file systems other than Hadoop.
44
1. Why Apache Flink is an alternative to
Apache Hadoop MapReduce?
 Flink lets users work in a unified framework allowing
to build a single data workflow that leverages,
streaming, batch, sql and machine learning for
example.
 Flink can analyze real-time streaming data.
 Flink can process graphs using its own Gelly library.
 Flink can use Machine Learning algorithms from its
own MLFlink library.
 Flink supports interactive queries and iterative
algorithms, not well served by Hadoop MapReduce.
45
1. Why Apache Flink is an alternative to Apache
Hadoop MapReduce?
 Flink extends MapReduce model with new operators:
join, cross, union, iterate, iterate delta, cogroup, …
Input Map Reduce Output
DataSet DataSet
DataSet
Red Join
DataSet Map DataSet
OutputS
Input
46
2. Why Apache Flink is an alternative to
Apache Storm?
 Higher Level and easier to use API
 More light-weight fault tolerance strategy
 Exactly-once processing guarantees
 Stateful operators
 Flink does also support batch processing
 ‘Twitter Heron: Stream Processing at Scale’ by
Twitter or “Why Storm Sucks by Twitter
themselves”!! http://dl.acm.org/citation.cfm?id=2742788
 Recap of the paper: ‘Twitter Heron: Stream
Processing at Scale’ - June 15th , 2015
http://blog.acolyer.org/2015/06/15/twitter-heron-stream-processing-at-scale/
47
3. Why Apache Flink is an alternative to
Apache Spark?
3.1. True Low latency streaming engine (not micro-
batches!) that unifies batch and streaming in a single
Big Data processing framework
3.2. Native closed-loop iteration operators making graph
and machine learning applications run much faster
3.3. Custom memory manager ( no more frequent Out Of
Memory errors!)
3.4. Automatic Cost Based Optimizer ( little re-
configuration and little maintenance when the cluster
characteristics change and the data evolves over time)
3.5. Little tuning or configuration required
48
3.1 True low latency streaming engine
 “I would consider stream data analysis to be a major
unique selling proposition for Flink. Due to its pipelined
architecture Flink is a perfect match for big data stream
processing in the Apache stack.” – Volker Markl
Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015
http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
 Apache Flink uses streams for all workloads:
streaming, SQL, micro-batch and batch. Batch is just
treated as a finite set of streamed data.
 This makes Flink the most sophisticated distributed
open source Big Data processing engine (not the most
mature one yet!).
49
3.2 Iteration Operators
Why Iterations? Many Machine Learning and Graph
processing algorithms need iterations! For example:
 Machine Learning Algorithms
• Clustering (K-Means, …)
• Logistic Regression
• Gradient descent
 Graph Processing Algorithms
• Page-Rank
• Path algorithms on graphs (shortest paths,
centralities, …)
• Graph communities / dense sub-components
• Inference (Belief propagation) 50
3.2 Iteration Operators
 Flink's API offers two dedicated iteration operations:
Iterate and Delta Iterate.
 Flink executes programs with iterations as cyclic
data flows: a data flow program (and all its operators)
is scheduled just once and the data is fed back from
the tail of an iteration to its head.
51
3.2 Iteration Operators
 Delta iterations run only on parts of the data that is
changing and can significantly speed up many
machine learning and graph algorithms because the
work in each iteration decreases as the number of
iterations goes on.
 Documentation on iterations with Apache
Flinkhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
52
3.2 Iteration Operators
Step
Step
Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
Non-native iterations in Hadoop and Spark are
implemented as regular for-loops.
53
3.2 Iteration Operators
 Although Spark caches data across iterations, it still
needs to schedule and execute a new set of tasks for
each iteration.
 Spinning Fast Iterative Data Flows - Ewen et al. 2012 :
http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The
Apache Flink model for incremental iterative dataflow
processing. Academic paper.
 Recap of the paper, June 18,
2015http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/
Documentation on iterations with Apache
Flinkhttp://ci.apache.org/projects/flink/flink-docs-
master/apis/iterations.html
54
3.3 Custom Memory Manager
Features:
 C++ style memory management inside the JVM
 User data stored in serialized byte arrays in JVM
 Memory is allocated, de-allocated, and used strictly
using an internal buffer pool implementation.
Advantages:
1. Flink will not throw an OOM exception on you.
2. Reduction of Garbage Collection (GC)
3. Very efficient disk spilling and network transfers
4. No Need for runtime tuning
5. More reliable and stable performance
55
3.3 Custom Memory Manager
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
Sorting,
hashing,
caching
Shuffling,
broadcasts
User code
objects
ManagedUnmanagedFlink contains its own memory management stack.
To do that, Flink contains its own type extraction
and serialization components.
JVM Heap
Network Buffers
56
3.3 Custom Memory Manager
 332 Posts on Out Of Memory errors on Apache
Spark Users List as of June 30th , 2015
 7 Posts on Out Of Memory errors on Apache Flink
Users List as of June 30th , 2015
 Compared to Flink, Spark is still behind in custom
memory management but it is catching up with its
project Tungsten for Memory Management and
Binary Processing: manage memory explicitly and
eliminate the overhead of JVM object model and
garbage collection. April 28,
2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html
57
3.4 Built-in Cost-Based Optimizer
 Apache Flink comes with an optimizer that is
independent of the actual programming interface.
 It chooses a fitting execution strategy depending
on the inputs and operations.
 Example: the "Join" operator will choose between
partitioning and broadcasting the data, as well as
between running a sort-merge-join or a hybrid hash
join algorithm.
 This helps you focus on your application logic
rather than parallel execution.
 Quick introduction to the Optimizer: section 6 of the
paper: ‘The Stratosphere platform for big data
analytics’http://stratosphere.eu/assets/papers/2014-
VLDBJ_Stratosphere_Overview.pdf
58
3.4 Built-in Cost-Based Optimizer
Run locally on a data
sample
on the laptop
Run a month later
after the data evolved
Hash vs. Sort
Partition vs. Broadcast
Caching
Reusing partition/sort
Execution
Plan A
Execution
Plan B
Run on large files
on the cluster
Execution
Plan C
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
59
3.4 Built-in Cost-Based Optimizer
In contrast to Flink’s built-in automatic optimization,
Spark jobs have to be manually optimized and
adapted to specific datasets because you need to
manually control partitioning and caching if you
want to get it right.
Spark SQL uses the Catalyst optimizer that
supports both rule-based and cost-based
optimization. References:
• Spark SQL: Relational Data Processing in
Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.p
df
• Deep Dive into Spark SQL’s Catalyst Optimizer
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-
catalyst-optimizer.html
60
3.5 Little tuning or configuration required
 Flink requires no memory thresholds to configure
 Flink manages its own memory
 Flink requires no complicated network configs
 Pipelining engine requires much less memory for
data exchange
 Flink requires no serializers to be configured
Flink handles its own type extraction and data
representation
Flink programs can be adjusted to data automatically
Flink’s optimizer can choose execution strategies
automatically
61
Benchmark between Spark and Flink!
http://goo.gl/WocQci
 The results were published in the proceedings of the
18th International Conference, Business Information
Systems 2015, Poznań, Poland, June 24-26, 2015.
Chapter 3: Evaluating New Approaches of Big Data
Analytics Frameworks, pages 28-37. http://goo.gl/WocQci
 Apache Flink outperforms Apache Spark in the
processing of machine learning & graph algorithms and
also relational queries.
 Apache Spark outperforms Apache Flink in batch
processing.
62
Benchmark between Spark and Flink!
http://goo.gl/WocQci
63
IV. Who is using Apache Flink?
You’ll probably hear more about who is using Flink in
production at the first dedicated Apache Flink
conference on October 12-13, 2015 in Berlin, Germany!
http://flink-forward.org/
64
V. Where to learn more about Flink?
Flink at the Apache Software Foundation: flink.apache.org/
data-artisans.com
@ApacheFlink, #ApacheFlink, #Flink
apache-flink.meetup.com
github.com/apache/flink
user@flink.apache.org dev@flink.apache.org
Flink Knowledge Base http://sparkbigdata.com/component/tags/tag/27-flink
65
VI. What are some key takeaways?
1. Although most of the current buzz is about Spark,
Flink offers the only hybrid open source (Real-
Time Streaming + Batch) distributed data
processing engine supporting many use cases.
2. I foresee more maturity of Apache Flink and more
adoption especially for use cases with Real-Time
stream processing or fast iterative processing.
3. Apache Spark and Apache Flink will both have
their sweet spots despite their “Me Too
Syndrome”!
4. If you like what you saw about Flink, there is
many ways to get involved with Flink and be an
early adopter. http://flink.apache.org/how-to-contribute.html66
Thank you!
67

More Related Content

What's hot

What is gRPC introduction gRPC Explained
What is gRPC introduction gRPC ExplainedWhat is gRPC introduction gRPC Explained
What is gRPC introduction gRPC Explainedjeetendra mandal
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Tk2323 lecture 11 process and thread
Tk2323 lecture 11   process and threadTk2323 lecture 11   process and thread
Tk2323 lecture 11 process and threadMengChun Lam
 
Networking in Java with NIO and Netty
Networking in Java with NIO and NettyNetworking in Java with NIO and Netty
Networking in Java with NIO and NettyConstantine Slisenka
 
Android AIDL Concept
Android AIDL ConceptAndroid AIDL Concept
Android AIDL ConceptCharile Tsai
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondBowen Li
 
ORM: Object-relational mapping
ORM: Object-relational mappingORM: Object-relational mapping
ORM: Object-relational mappingAbhilash M A
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Spring - Part 1 - IoC, Di and Beans
Spring - Part 1 - IoC, Di and Beans Spring - Part 1 - IoC, Di and Beans
Spring - Part 1 - IoC, Di and Beans Hitesh-Java
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesVineet Sabharwal
 
The Kotlin Programming Language
The Kotlin Programming LanguageThe Kotlin Programming Language
The Kotlin Programming Languageintelliyole
 
Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...
Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...
Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...Edureka!
 
gRPC and Microservices
gRPC and MicroservicesgRPC and Microservices
gRPC and MicroservicesJonathan Gomez
 
OpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-SideOpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-SideTim Burks
 

What's hot (20)

Maven introdução Muito Rápida
Maven introdução Muito RápidaMaven introdução Muito Rápida
Maven introdução Muito Rápida
 
What is gRPC introduction gRPC Explained
What is gRPC introduction gRPC ExplainedWhat is gRPC introduction gRPC Explained
What is gRPC introduction gRPC Explained
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
CSharp.ppt
CSharp.pptCSharp.ppt
CSharp.ppt
 
Tk2323 lecture 11 process and thread
Tk2323 lecture 11   process and threadTk2323 lecture 11   process and thread
Tk2323 lecture 11 process and thread
 
Networking in Java with NIO and Netty
Networking in Java with NIO and NettyNetworking in Java with NIO and Netty
Networking in Java with NIO and Netty
 
Introduction to gRPC
Introduction to gRPCIntroduction to gRPC
Introduction to gRPC
 
Android AIDL Concept
Android AIDL ConceptAndroid AIDL Concept
Android AIDL Concept
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
ORM: Object-relational mapping
ORM: Object-relational mappingORM: Object-relational mapping
ORM: Object-relational mapping
 
Qt Qml
Qt QmlQt Qml
Qt Qml
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Spring - Part 1 - IoC, Di and Beans
Spring - Part 1 - IoC, Di and Beans Spring - Part 1 - IoC, Di and Beans
Spring - Part 1 - IoC, Di and Beans
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for Microservices
 
Kotlin
KotlinKotlin
Kotlin
 
The Kotlin Programming Language
The Kotlin Programming LanguageThe Kotlin Programming Language
The Kotlin Programming Language
 
Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...
Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...
Swift Tutorial For Beginners | Swift Programming Tutorial | IOS App Developme...
 
gRPC and Microservices
gRPC and MicroservicesgRPC and Microservices
gRPC and Microservices
 
OpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-SideOpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-Side
 

Viewers also liked

Testing distributed, complex web applications
Testing distributed, complex web applicationsTesting distributed, complex web applications
Testing distributed, complex web applicationsJens-Christian Fischer
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 

Viewers also liked (11)

Testing distributed, complex web applications
Testing distributed, complex web applicationsTesting distributed, complex web applications
Testing distributed, complex web applications
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 

Similar to Overview of Apache Flink: Next-Gen Big Data Analytics Framework

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online TrainingLearntek1
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache FlinkAljoscha Krettek
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureGyula Fóra
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community UpdateRobert Metzger
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Stephan Ewen
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkFlink Forward
 

Similar to Overview of Apache Flink: Next-Gen Big Data Analytics Framework (20)

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online Training
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache Flink
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. Spark
 

More from Slim Baltagi

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesSlim Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Slim Baltagi
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
 

More from Slim Baltagi (13)

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetes
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Overview of Apache Flink: Next-Gen Big Data Analytics Framework

  • 1. Overview of Apache Flink: Next-Gen Big Data Analytics Framework By Slim Baltagi @SlimBaltagi Chicago Apache Flink Meetup June 30th, 2015
  • 2. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Apache Hadoop and other open source tools? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? VI. What are some Key Takeaways? 2
  • 3. I. What is Apache Flink stack and how it fits into the Big Data ecosystem? 1. What is Big Data? 2. What is a typical Big Data Analytics Stack? 3. What is Apache Flink? 4. What is Flink Execution Engine? 5. What are Flink APIs? 6. What are Flink Domain Specific Libraries? 7. How Flink offers Interactive Data Analysis? 8. What is Flink Architecture? 9. What is Flink Programming Model? 3
  • 4. 1. What is Big Data? “Big Data refers to data sets large enough and data streams fast enough, from heterogeneous data sources, that has outpaced our capability to store, process, analyze, and understand.” 4
  • 5. 2. What is a typical Big Data Analytics Stack: Hadoop, Spark, Flink, …? 5
  • 6. 3. What is Apache Flink?  Apache Flink, like Apache Hadoop and Apache Spark, is a community-driven open source framework for distributed Big Data Analytics. Apache Flink engine exploits data streaming and in-memory processing and iteration operators to improve performance.  Apache Flink has its origins in a research project called Stratosphere of which the idea was conceived in 2008 by professor Volker Markl from the Technische Universität Berlin in Germany.  In German, Flink means agile or swift. Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.6
  • 7. 3. What is Apache Flink? The Apache Flink framework, written in Java, provides: 1. Big data processing engine: distributed and scalable streaming dataflow engine 2. Several APIs in Java/Scala/Python: • DataSet API – Batch processing • DataStream API – Real-Time streaming analytics • Table API - Relational Queries 3. Domain-Specific Libraries: • FlinkML: Machine Learning Library for Flink • Gelly: Graph Library for Flink 4. Shell for interactive data analysis 7
  • 8. • Declarativity • Query optimization • Efficient parallel in- memory and out-of- core algorithms • Massive scale-out • User Defined Functions • Complex data types • Schema on read • Streaming • Iterations • Advanced Dataflows • General APIs Draws on concepts from MPP Database Technology Draws on concepts from Hadoop MapReduce Technology Add Key Vision of Apache Flink 8
  • 9. What is Apache Flink stack? Gelly Table ML SAMOA DataSet (Java/Scala/Python) Batch Processing DataStream (Java/Scala) Stream Processing HadoopM/R Local Single JVM Embedded Cluster Standalone YARN, Tez Cloud Google’s GCE Amazon’s EC2 Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Runtime Distributed Streaming Dataflow Zeppelin DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE Files Local HDFS S3 Tachyon Databases MongoDB HBase SQL … Streams Flume Kafka RabbitMQ … Batch Optimizer Stream Builder 9
  • 10. 4. What is Flink Execution Engine? The core of Flink is a distributed and scalable streaming dataflow engine with some unique features: 1. True streaming capabilities: Execute everything as streams 2. Native iterative execution: Allow some cyclic dataflows 3. Handling of mutable state 4. Custom memory manager: Operate on managed memory 5. Cost-Based Optimizer: for both batch and stream processing 10
  • 11. The only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: Real-Time stream processing Machine Learning at scale Graph AnalysisBatch Processing 11
  • 12. Why Apache Flink is Next-Gen? • Batch • Batch • Interactive • Batch • Interactive • Near-Real time Streaming • Iterative processing • Batch • Interactive • Real-Time Streaming • Native Iterative processing MapReduce Direct Acyclic Graphs (DAG) Dataflows RDD: Resilient Distributed Datasets Cyclic Dataflows 1st Generation (1G) 2ndGeneration (2G) 3rd Generation (3G) 4th Generation (4G) 12
  • 13. 5. Flink APIs 5.1 DataSet API for static data - Java, Scala, and Python 5.2 DataStream API for unbounded real-time streams - Java and Scala 5.3 Table API for relational queries - Java and Scala 13
  • 14. 5.1 DataSet API – Batch processing case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming): 14
  • 15. 5.2 DataStream API – Real-Time Streaming Analytics  Many time-critical applications need to process large streams of live data and provide results in real-time. For example: • Fraud detection • Financial Stock monitoring • Anomaly detection • Traffic management applications • Online recommenders  Flink Streaming provides high-throughput, low-latency stateful stream processing system with rich windowing semantics. It has built-in connectors to many data sources like Flume, Kafka, Twitter, RabbitMQ 15
  • 16. 5.2 DataStream API – Real-Time Streaming Analytics  Still in Beta as of June 24th 2015 ( Flink 0.9 release)  Data streams can be transformed and modified using high-level functions similar to the ones provided by the batch processing API.  Flink Streaming provides native support for iterative stream processing.  Streaming Fault-Tolerance added in Flink 0.9 (released on June 24th , 2015) allows Exactly-once processing delivery guarantees for Flink streaming programs that analyze streaming sources persisted by Apache Kafka. See paper: ‘Lightweight Asynchronous Snapshots for Distributed Dataflows’ http://arxiv.org/pdf/1506.08603v1.pdf June 28, 2015 16
  • 17. 5.2 DataStream API – Real-Time Streaming Analytics  Data Streaming Fault Tolerance document: http://ci.apache.org/projects/flink/flink-docs- master/internals/stream_checkpointing.html Flink being based on a pipelined execution engine akin to parallel database systems allows: • to integrate streaming operations with rich windowing semantics seamlessly • process streaming operations in a pipelined way with lower latency than micro-batch architectures and without the complexity of lambda architectures.  Flink Streaming web resources at the Flink Knowledge Base http://sparkbigdata.com/component/tags/tag/49-flink-streaming 17
  • 18. 5.3 Table API – Relational Queries val customers = envreadCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE") val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio") val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue") val result = items .groupBy("orderId, orderDate, shipPrio") .select("orderId, revenue.sum, orderDate, shipPrio") Table API (queries) 18
  • 19. 5.3 Table API – Relational Queries  Table API added in February 2015. Still in Beta as of June 24th 2015 ( Flink 0.9 release)  Flink provides Table API that allows specifying operations using SQL-like expressions instead of manipulating DataSet or DataStream.  Table API can be used in both batch (on structured data sets) and streaming programs (on structured data streams).http://ci.apache.org/projects/flink/flink-docs- master/libs/table.html  Flink Table web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/52- flink-table 19
  • 20. 6. Flink Domain Specific Libraries 6.1 FlinkML – Machine Learning Library 6.2 Gelly – Graph Analytics for Flink 20
  • 21. 6.1 FlinkML - Machine Learning Library  FlinkML is the Machine Learning (ML) library for Flink that was added in March 2015. Still in beta as of June 24th 2015 ( Flink 0.9 release)  FlinkML aims to provide: • an intuitive API • scalable ML algorithms • tools that help minimize glue code in end-to-end ML applications  FlinkML will allow data scientists to: • test their models locally using subsets of data • use the same code to run their algorithms at a much larger scale in a cluster setting. 21
  • 22. 6.1 FlinkML  FlinkML is inspired by other open source efforts, in particular: • scikit-learn for cleanly specifying ML pipelines • Spark’s MLLib for providing ML algorithms that scale with cluster size.  FlinkML unique features are: 1. Exploiting the in-memory data streaming nature of Flink. 2. Natively executing iterative processing algorithms which are common in Machine Learning. 3. Streaming ML designed specifically for data streams. 22
  • 23. 6.1 FlinkML  Learn more about FlinkML at http://ci.apache.org/projects/flink/flink-docs-master/libs/ml/  You can find more details about FlinkML goals and where it is headed in the vision and roadmap here: FlinkML: Vision and Roadmap https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+an d+Roadmap  Check more FlinkML web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/51-flinkml  Interested in helping out the Apache Flink project? Please check: How to contribute? http://flink.apache.org/how- to-contribute.html 23
  • 24. 6.2 Gelly – Graph Analytics for Flink  Gelly is a Graph API for Flink. Gelly Java API was added in February 2015. Gelly Scala API started in May 2015 and is Work In Progress.  Gelly is still in Beta as of June 24th 2015 ( Flink 0.9 release).  Gelly provides: • A set of methods and utilities to create, transform and modify graphs • A library of graph algorithms which aims to simplify the development of graph analysis applications • Iterative graph algorithms are executed leveraging mutable state 24
  • 25. 6.2 Gelly – Graph Analytics for Flink  Large-scale graph processing with Apache Flink - Vasia Kalavri, February 1st, 2015http://www.slideshare.net/vkalavri/largescale-graph-processing-with-apache- flink-graphdevroom-fosdem15  Graph streaming model and API on top of Flink streaming and provides similar interfaces to Gelly – Janos Daniel Balo, June 30, 2015http://kth.diva- portal.org/smash/get/diva2:830662/FULLTEXT01.pdf  Check out more Gelly web resources at the Apache Flink Knowledge Base:http://sparkbigdata.com/component/tags/tag/50-gelly  Interested in helping out the Apache Flink project Please check: How to contribute? http://flink.apache.org/how-to- contribute.html 25
  • 26. 7. How Flink offers Interactive Data Analysis? Interactive Shell - Scala REPL ( Read Eval Print Loop ) : • ./bin/start-scala-shell.sh • Complete Scala API available • Syntax completion • Build your jobs incrementally • Upcoming: Caching of intermediate results (WIP) 26
  • 27. 7. How Flink offers Interactive Data Analysis?  Apache Zeppelin: collaborative data analytics and visualization tool 27
  • 28. 8. What is Flink Architecture?  Flink implements the Kappa Architecture: run batch programs on a streaming system.  References about the Kappa Architecture: • Questioning the Lambda Architecture - Jay Kreps , July 2nd, 2014 http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html • Turning the database inside out with Apache Samza -Martin Kleppmann, March 4th, 2015 o http://www.youtube.com/watch?v=fU9hR3kiOK0 (VIDEO) o http://martin.kleppmann.com/2015/03/04/turning-the-database-inside- out.html(TRANSCRIPT) o http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with- apache-samza/ (BLOG) 28
  • 29. 8. What is Flink Architecture? 8.1 Client 8.2 Master (Job Manager) 8.3 Worker (Task Manager) 29
  • 30. 8.1 Client  Type extraction  Optimize: in all APIs not just SQL queries as in Spark  Construct job Dataflow graph  Pass job Dataflow graph to job manager  Retrieve job results Job Manager Client case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward 30
  • 31. 8.2 Job Manager (JM)  Parallelization: Create Execution Graph  Scheduling: Assign tasks to task managers  State tracking: Supervise the execution  JM High Availability is still a Work In Progress as of Flink 0.9 https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability Job Manager Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward Task Manager Task Manager Task Manager Task Manager Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d 31
  • 32. 8.3 Task Manager ( TM)  Operations are split up into tasks depending on the specified parallelism  Each parallel instance of an operation runs in a separate task slot  The scheduler may run several tasks from different operators in one task slot Task Manager S l o t Task ManagerTask Manager S l o t S l o t 32
  • 33. 9. What is Flink Programming Model?  DataSet and DataStream as programming abstractions are the foundation for user programs and higher layers.  Flink extends the MapReduce model with new operators that represent many common data analysis tasks more naturally and efficiently.  All operators will start working in memory and gracefully go out of core under memory pressure. 33
  • 34. 9.1 DataSet • Central notion of the programming API • Files and other data sources are read into DataSets – DataSet<String> text = env.readTextFile(…) • Transformations on DataSets produce DataSets – DataSet<String> first = text.map(…) • DataSets are printed to files or on stdout – first.writeAsCsv(…) • Execution is triggered with env.execute() 34
  • 35. 9.1 DataSet Used for Batch Processing Data Set Operation Data Set Source Example: Map and Reduce operation Sink b h 2 1 3 5 7 4 … … Map Reduce a 1 2 … 35
  • 36. 9.2 DataStream Real-time event streams Data Stream Operation Data Stream Source Sink Stock Feed Name Price Microsoft 124 Google 516 Apple 235 … … Alert if Microsoft > 120 Write event to database Sum every 10 seconds Alert if sum > 10000 Microsoft 124 Google 516 Apple 235 Microsoft 124 Google 516 Apple 235 Example: Stream from a live stock feed 36
  • 37. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Apache Hadoop and other open source tools? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? VI. What are some key takeaways? 37
  • 38. II. How Apache Flink integrates with Hadoop and other open source tools?  Flink integrates well with other open source tools for data input and output as well as deployment.  Hadoop integration out of the box: • HDFS to read and write. Secure HDFS support • Deploy inside of Hadoop via YARN • Reuse data types (that implement Writables interface)  YARN Setup http://ci.apache.org/projects/flink/flink-docs- master/setup/yarn_setup.html  YARN Configuration http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#yarn 38
  • 39. II. How Apache Flink integrates with Hadoop and other open source tools? Hadoop Compatibility in Flink by Fabian Hüske - November 18, 2014 http://flink.apache.org/news/2014/11/18/hadoop- compatibility.html Hadoop integration with a thin wrapper (Hadoop Compatibility layer) to run legacy Hadoop MapReduce jobs, reuse Hadoop input and output formats and reuse functions like Map and Reduce. 39
  • 40. II. How Apache Flink integrates with Hadoop and other open source tools? Service Open Source Tool Storage/Servi ng Layer Data Formats Data Ingestion Services Resource Management 40
  • 41. II. How Apache Flink integrates with Hadoop and other open source tools?  Apache Flink + Apache SAMOA for Machine Learning on streams http://samoa.incubator.apache.org/  Flink Integrates with Zeppelin http://zeppelin.incubator.apache.org/  Flink on Apache Tez http://tez.apache.org/  Flink + Apache MRQL http://mrql.incubator.apache.org  Flink + Tachyon http://tachyon-project.org/ Running Apache Flink on Tachyon http://tachyon-project.org/Running- Flink-on-Tachyon.html  XtreemFS http://www.xtreemfs.org/ 41
  • 42. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Apache Hadoop and other open source tools? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? VI. What are some key takeaways? 42
  • 43. III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? 1. Why Apache Flink is an alternative to Apache Hadoop MapReduce? 2. Why Apache Flink is an alternative to Apache Storm? 3. Why Apache Flink is an alternative to Apache Spark? 43
  • 44. 1. Why Apache Flink is an alternative to Apache Hadoop MapReduce?  Flink offers cyclic dataflows compared to the two- stage, disk-based MapReduce paradigm.  The application programming interface (API) for Flink is easier to use than programming for Hadoop’s MapReduce.  Flink is easier to test compared to MapReduce.  Flink can leverage in-memory processing, data streaming and iteration operators for faster data processing speed.  Flink can work on file systems other than Hadoop. 44
  • 45. 1. Why Apache Flink is an alternative to Apache Hadoop MapReduce?  Flink lets users work in a unified framework allowing to build a single data workflow that leverages, streaming, batch, sql and machine learning for example.  Flink can analyze real-time streaming data.  Flink can process graphs using its own Gelly library.  Flink can use Machine Learning algorithms from its own MLFlink library.  Flink supports interactive queries and iterative algorithms, not well served by Hadoop MapReduce. 45
  • 46. 1. Why Apache Flink is an alternative to Apache Hadoop MapReduce?  Flink extends MapReduce model with new operators: join, cross, union, iterate, iterate delta, cogroup, … Input Map Reduce Output DataSet DataSet DataSet Red Join DataSet Map DataSet OutputS Input 46
  • 47. 2. Why Apache Flink is an alternative to Apache Storm?  Higher Level and easier to use API  More light-weight fault tolerance strategy  Exactly-once processing guarantees  Stateful operators  Flink does also support batch processing  ‘Twitter Heron: Stream Processing at Scale’ by Twitter or “Why Storm Sucks by Twitter themselves”!! http://dl.acm.org/citation.cfm?id=2742788  Recap of the paper: ‘Twitter Heron: Stream Processing at Scale’ - June 15th , 2015 http://blog.acolyer.org/2015/06/15/twitter-heron-stream-processing-at-scale/ 47
  • 48. 3. Why Apache Flink is an alternative to Apache Spark? 3.1. True Low latency streaming engine (not micro- batches!) that unifies batch and streaming in a single Big Data processing framework 3.2. Native closed-loop iteration operators making graph and machine learning applications run much faster 3.3. Custom memory manager ( no more frequent Out Of Memory errors!) 3.4. Automatic Cost Based Optimizer ( little re- configuration and little maintenance when the cluster characteristics change and the data evolves over time) 3.5. Little tuning or configuration required 48
  • 49. 3.1 True low latency streaming engine  “I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack.” – Volker Markl Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015 http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/  Apache Flink uses streams for all workloads: streaming, SQL, micro-batch and batch. Batch is just treated as a finite set of streamed data.  This makes Flink the most sophisticated distributed open source Big Data processing engine (not the most mature one yet!). 49
  • 50. 3.2 Iteration Operators Why Iterations? Many Machine Learning and Graph processing algorithms need iterations! For example:  Machine Learning Algorithms • Clustering (K-Means, …) • Logistic Regression • Gradient descent  Graph Processing Algorithms • Page-Rank • Path algorithms on graphs (shortest paths, centralities, …) • Graph communities / dense sub-components • Inference (Belief propagation) 50
  • 51. 3.2 Iteration Operators  Flink's API offers two dedicated iteration operations: Iterate and Delta Iterate.  Flink executes programs with iterations as cyclic data flows: a data flow program (and all its operators) is scheduled just once and the data is fed back from the tail of an iteration to its head. 51
  • 52. 3.2 Iteration Operators  Delta iterations run only on parts of the data that is changing and can significantly speed up many machine learning and graph algorithms because the work in each iteration decreases as the number of iterations goes on.  Documentation on iterations with Apache Flinkhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html 52
  • 53. 3.2 Iteration Operators Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job } Non-native iterations in Hadoop and Spark are implemented as regular for-loops. 53
  • 54. 3.2 Iteration Operators  Although Spark caches data across iterations, it still needs to schedule and execute a new set of tasks for each iteration.  Spinning Fast Iterative Data Flows - Ewen et al. 2012 : http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The Apache Flink model for incremental iterative dataflow processing. Academic paper.  Recap of the paper, June 18, 2015http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/ Documentation on iterations with Apache Flinkhttp://ci.apache.org/projects/flink/flink-docs- master/apis/iterations.html 54
  • 55. 3.3 Custom Memory Manager Features:  C++ style memory management inside the JVM  User data stored in serialized byte arrays in JVM  Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation. Advantages: 1. Flink will not throw an OOM exception on you. 2. Reduction of Garbage Collection (GC) 3. Very efficient disk spilling and network transfers 4. No Need for runtime tuning 5. More reliable and stable performance 55
  • 56. 3.3 Custom Memory Manager public class WC { public String word; public int count; } empty page Pool of Memory Pages Sorting, hashing, caching Shuffling, broadcasts User code objects ManagedUnmanagedFlink contains its own memory management stack. To do that, Flink contains its own type extraction and serialization components. JVM Heap Network Buffers 56
  • 57. 3.3 Custom Memory Manager  332 Posts on Out Of Memory errors on Apache Spark Users List as of June 30th , 2015  7 Posts on Out Of Memory errors on Apache Flink Users List as of June 30th , 2015  Compared to Flink, Spark is still behind in custom memory management but it is catching up with its project Tungsten for Memory Management and Binary Processing: manage memory explicitly and eliminate the overhead of JVM object model and garbage collection. April 28, 2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark- closer-to-bare-metal.html 57
  • 58. 3.4 Built-in Cost-Based Optimizer  Apache Flink comes with an optimizer that is independent of the actual programming interface.  It chooses a fitting execution strategy depending on the inputs and operations.  Example: the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.  This helps you focus on your application logic rather than parallel execution.  Quick introduction to the Optimizer: section 6 of the paper: ‘The Stratosphere platform for big data analytics’http://stratosphere.eu/assets/papers/2014- VLDBJ_Stratosphere_Overview.pdf 58
  • 59. 3.4 Built-in Cost-Based Optimizer Run locally on a data sample on the laptop Run a month later after the data evolved Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Execution Plan A Execution Plan B Run on large files on the cluster Execution Plan C What is Automatic Optimization? The system's built-in optimizer takes care of finding the best way to execute the program in any environment. 59
  • 60. 3.4 Built-in Cost-Based Optimizer In contrast to Flink’s built-in automatic optimization, Spark jobs have to be manually optimized and adapted to specific datasets because you need to manually control partitioning and caching if you want to get it right. Spark SQL uses the Catalyst optimizer that supports both rule-based and cost-based optimization. References: • Spark SQL: Relational Data Processing in Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.p df • Deep Dive into Spark SQL’s Catalyst Optimizer https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls- catalyst-optimizer.html 60
  • 61. 3.5 Little tuning or configuration required  Flink requires no memory thresholds to configure  Flink manages its own memory  Flink requires no complicated network configs  Pipelining engine requires much less memory for data exchange  Flink requires no serializers to be configured Flink handles its own type extraction and data representation Flink programs can be adjusted to data automatically Flink’s optimizer can choose execution strategies automatically 61
  • 62. Benchmark between Spark and Flink! http://goo.gl/WocQci  The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015. Chapter 3: Evaluating New Approaches of Big Data Analytics Frameworks, pages 28-37. http://goo.gl/WocQci  Apache Flink outperforms Apache Spark in the processing of machine learning & graph algorithms and also relational queries.  Apache Spark outperforms Apache Flink in batch processing. 62
  • 63. Benchmark between Spark and Flink! http://goo.gl/WocQci 63
  • 64. IV. Who is using Apache Flink? You’ll probably hear more about who is using Flink in production at the first dedicated Apache Flink conference on October 12-13, 2015 in Berlin, Germany! http://flink-forward.org/ 64
  • 65. V. Where to learn more about Flink? Flink at the Apache Software Foundation: flink.apache.org/ data-artisans.com @ApacheFlink, #ApacheFlink, #Flink apache-flink.meetup.com github.com/apache/flink user@flink.apache.org dev@flink.apache.org Flink Knowledge Base http://sparkbigdata.com/component/tags/tag/27-flink 65
  • 66. VI. What are some key takeaways? 1. Although most of the current buzz is about Spark, Flink offers the only hybrid open source (Real- Time Streaming + Batch) distributed data processing engine supporting many use cases. 2. I foresee more maturity of Apache Flink and more adoption especially for use cases with Real-Time stream processing or fast iterative processing. 3. Apache Spark and Apache Flink will both have their sweet spots despite their “Me Too Syndrome”! 4. If you like what you saw about Flink, there is many ways to get involved with Flink and be an early adopter. http://flink.apache.org/how-to-contribute.html66