SlideShare a Scribd company logo
1 of 39
Apache Flink
Fast and Reliable Large-Scale Data Processing
Fabian Hueske
fhueske@apache.org @fhueske
1
About me
Fabian Hueske
fhueske@apache.org
• PMC member of Apache Flink
• With Flink since its beginnings in 2009
– back then an academic research project called Stratosphere
What is Apache Flink?
Distributed Data Flow Processing System
• Focused on large-scale data analytics
• Real-time stream and batch processing
• Easy and powerful APIs (Java / Scala)
• Robust execution backend
3
4
Flink in the Hadoop Ecosystem
TableAPI
GellyLibrary
MLLibrary
ApacheSAMOA
Optimizer
DataSet API (Java/Scala) DataStream API (Java/Scala)
Stream Builder
Runtime
Local Cluster Yarn Apache TezEmbedded
ApacheMRQL
Dataflow
HDFS
S3JDBCHCatalog
Apache HBase Apache Kafka Apache Flume
RabbitMQ
Hadoop IO
...
Data
Sources
Execution
Environments
Flink Core
Libraries
TableAPI
What is Flink good at?
It‘s a general-purpose data analytics system
• Complex and heavy ETL jobs
• Real-time stream processing with flexible windows
• Analyzing huge graphs
• Machine-learning on large data sets
• ...
5
Flink in the ASF
• Flink entered the ASF about one year ago
– 04/2014: Incubation
– 12/2014: Graduation
• Strongly growing community
– 17 committers
– 15 PMC members
0
20
40
60
80
100
120
Nov-10 Apr-12 Aug-13 Dec-14
#unique git committers (w/o manual de-dup)
6
Where is Flink moving?
A "use-case complete" framework to unify
batch & stream processing
Data Streams
• Kafka
• RabbitMQ
• ...
“Historic” data
• HDFS
• JDBC
• ...
Analytical Workloads
• ETL
• Relational processing
• Graph analysis
• Machine learning
• Streaming data analysis
7
Goal: Treat batch as finite stream
HOW TO USE FLINK?
Programming Model & APIs
8
DataSets and Transformations
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input
.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first
.map(str -> str.toLowerCase());
second.print();
env.execute();
Input First Secondfilter map
9
Expressive Transformations
• Element-wise
– map, flatMap, filter, project
• Group-wise
– groupBy, reduce, reduceGroup, combineGroup,
mapPartition, aggregate, distinct
• Binary
– join, coGroup, union, cross
• Iterations
– iterate, iterateDelta
• Physical re-organization
– rebalance, partitionByHash, sortPartition
• Streaming
– window, windowMap, coMap, ...
10
Rich Type System
• Use any Java/Scala classes as a data type
– Special support for tuples, POJOs, and case classes
– Not restricted to key-value pairs
• Define (composite) keys directly on data types
– Expression
– Tuple position
– Selector function
11
Counting Words in Batch and Stream
12
case class WordCount (word: String, count: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => WordCount(word,1))}
.window(Count.of(1000)).every(Count.of(100))
.groupBy("word").sum("count")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => WordCount(word,1))}
.groupBy("word").sum("count")
.print()
DataSet API (batch):
DataStream API (streaming):
Table API
• Execute SQL-like expressions on table data
– Tight integration with Java and Scala APIs
– Available for batch and streaming programs
val orders = env.readCsvFile(…)
.as('oId, 'oDate, 'shipPrio)
.filter('shipPrio === 5)
val items = orders
.join(lineitems).where('oId === 'id)
.select('oId, 'oDate, 'shipPrio,
'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)
val result = items
.groupBy('oId, 'oDate, 'shipPrio)
.select('oId, 'revenue.sum, 'oDate, 'shipPrio)
13
WHAT IS HAPPENING INSIDE?
Processing Engine
14
System Architecture
Coordination
Processing
Optimization
Client (pre-flight) Master
Workers
...
Flink
Program
...
15
Lots of cool technology inside Flink
• Batch and Streaming in one system
• Memory-safe execution
• Native data flow iterations
• Cost-based data flow optimizer
• Flexible windows on data streams
• Type extraction and custom serialization utilities
• Static code analysis on user functions
• and much more...
16
STREAM AND BATCH IN ONE SYSTEM
Pipelined Data Transfer
17
Stream and Batch in one System
• Most systems do either stream or batch
• In the past, Flink focused on batch processing
– Flink‘s runtime has always done stream processing
– Operators pipeline data forward as soon as it is processed
– Some operators are blocking (such as sort)
• Pipelining gives better performance for many
batch workloads
– Avoids materialization of large intermediate results
18
Pipelined Data Transfer
Large
Input
Small
Input
Small
Input
Resultjoin
Large
Input
map
Interm.
DataSet
Build
HT
Result
Program
Pipelined
Execution
Pipeline 1
Pipeline 2
joinProbe
HT
map No intermediate
materialization!
19
Flink Data Stream Processing
• Pipelining enables Flink‘s runtime to process streams
– API to define stream programs
– Operators for stream processing
• Stream API and operators are recent contributions
– Evolving very quickly under heavy development
– Integration with batch mode
20
MEMORY SAFE EXECUTION
Memory Management and Out-of-Core Algorithms
21
Memory-safe Execution
• Challenge of JVM-based data processing systems
– OutOfMemoryErrors due to data objects on the heap
• Flink runs complex data flows without memory tuning
– C++-style memory management
– Robust out-of-core algorithms
22
Managed Memory
• Active memory management
– Workers allocate 70% of JVM memory as byte arrays
– Algorithms serialize data objects into byte arrays
– In-memory processing as long as data is small enough
– Otherwise partial destaging to disk
• Benefits
– Safe memory bounds (no OutOfMemoryError)
– Scales to very large JVMs
– Reduced GC pressure
23
Going out-of-core
Single-core join of 1KB Java objects beyond memory (4 GB)
Blue bars are in-memory, orange bars (partially) out-of-core
24
GRAPH ANALYSIS
Native Data Flow Iterations
25
Native Data Flow Iterations
• Many graph and ML algorithms require iterations
• Flink features two types of iterations
– Bulk iterations
– Delta iterations
26
2
1
5
4
3
0.1
0.5
0.2
0.4
0.7
0.3
0.9
Iterative Data Flows
• Flink runs iterations „natively“ as cyclic data flows
– No loop unrolling
– Operators are scheduled once
– Data is fed back through backflow channel
– Loop-invariant data is cached
• Operator state is preserved across iterations!
initial
input
Iteration
head
resultreducejoin
Iteration
tail
other
datasets 27
Delta Iterations
• Delta iteration computes
– Delta update of solution set
– Work set for next iteration
• Work set drives computations of next iteration
– Workload of later iterations significantly reduced
– Fast convergence
• Applicable to certain problem domains
– Graph processing
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 5
#ofelementsupdated
# of iterations
28
Iteration Performance
PageRank on Twitter Follower Graph
30 Iterations
61 Iterations (Convergence)
29
WHAT IS COMING NEXT?
Roadmap
30
Just released a new version
• 0.9.0-milestone1 preview release just got out!
• Features
– Table API
– Flink ML Library
– Gelly Library
– Flink on Tez
– …
31
Flink’s Roadmap
Mission: Unified stream and batch processing
• Exactly-once streaming semantics with
flexible state checkpointing
• Extending the ML library
• Extending graph library
• Integration with Apache Zeppelin (incubating)
• SQL on top of the Table API
• And much more…
32
tl;dr – What’s worth to remember?
• Flink is a general-purpose data analytics system
• Unifies streaming and batch processing
• Expressive high-level APIs
• Robust and fast execution engine
33
I Flink, do you? ;-)
If you find Flink exciting,
get involved and start a discussion on Flink‘s ML
or stay tuned by
subscribing to news@flink.apache.org or
following @ApacheFlink on Twitter
34
35
BACKUP
36
Data Flow Optimizer
• Database-style optimizations for parallel data flows
• Optimizes all batch programs
• Optimizations
– Task chaining
– Join algorithms
– Re-use partitioning and sorting for later operations
– Caching for iterations
37
Data Flow Optimizer
val orders = …
val lineitems = …
val filteredOrders = orders
.filter(o => dataFormat.parse(l.shipDate).after(date))
.filter(o => o.shipPrio > 2)
val lineitemsOfOrders = filteredOrders
.join(lineitems)
.where(“orderId”).equalTo(“orderId”)
.apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice))
val priceSums = lineitemsOfOrders
.groupBy(“orderDate”)
.sum(“l.extdPrice”);
38
Data Flow Optimizer
DataSource
orders.tbl
Filter DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
Reduce
sort[0,1]
DataSource
orders.tbl
Filter DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
Reduce
sort[0,1]
Best plan
depends on
relative sizes of
input filespartial sort[0,1]
39

More Related Content

What's hot

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 

What's hot (20)

Apache flink
Apache flinkApache flink
Apache flink
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 

Viewers also liked

Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
 

Viewers also liked (15)

Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015Apache Flink - Community Update January 2015
Apache Flink - Community Update January 2015
 
Apache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated QueriesApache Flink - A Sneek Preview on Language Integrated Queries
Apache Flink - A Sneek Preview on Language Integrated Queries
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!Apache Flink - Akka for the Win!
Apache Flink - Akka for the Win!
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processingTimo Walther - Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream ...Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream ...
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIs
 
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 

Similar to ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
Maryann Xue
 
Advanced
AdvancedAdvanced
Advanced
mxmxm
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Yahoo Developer Network
 

Similar to ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing (20)

Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Advanced
AdvancedAdvanced
Advanced
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)Data Analysis with Apache Flink (Hadoop Summit, 2015)
Data Analysis with Apache Flink (Hadoop Summit, 2015)
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 

Recently uploaded

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
HyderabadDolls
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 

Recently uploaded (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
 
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

  • 1. Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske fhueske@apache.org @fhueske 1
  • 2. About me Fabian Hueske fhueske@apache.org • PMC member of Apache Flink • With Flink since its beginnings in 2009 – back then an academic research project called Stratosphere
  • 3. What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 3
  • 4. 4 Flink in the Hadoop Ecosystem TableAPI GellyLibrary MLLibrary ApacheSAMOA Optimizer DataSet API (Java/Scala) DataStream API (Java/Scala) Stream Builder Runtime Local Cluster Yarn Apache TezEmbedded ApacheMRQL Dataflow HDFS S3JDBCHCatalog Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO ... Data Sources Execution Environments Flink Core Libraries TableAPI
  • 5. What is Flink good at? It‘s a general-purpose data analytics system • Complex and heavy ETL jobs • Real-time stream processing with flexible windows • Analyzing huge graphs • Machine-learning on large data sets • ... 5
  • 6. Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community – 17 committers – 15 PMC members 0 20 40 60 80 100 120 Nov-10 Apr-12 Aug-13 Dec-14 #unique git committers (w/o manual de-dup) 6
  • 7. Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka • RabbitMQ • ... “Historic” data • HDFS • JDBC • ... Analytical Workloads • ETL • Relational processing • Graph analysis • Machine learning • Streaming data analysis 7 Goal: Treat batch as finite stream
  • 8. HOW TO USE FLINK? Programming Model & APIs 8
  • 9. DataSets and Transformations ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> input = env.readTextFile(input); DataSet<String> first = input .filter (str -> str.contains(“Apache Flink“)); DataSet<String> second = first .map(str -> str.toLowerCase()); second.print(); env.execute(); Input First Secondfilter map 9
  • 10. Expressive Transformations • Element-wise – map, flatMap, filter, project • Group-wise – groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct • Binary – join, coGroup, union, cross • Iterations – iterate, iterateDelta • Physical re-organization – rebalance, partitionByHash, sortPartition • Streaming – window, windowMap, coMap, ... 10
  • 11. Rich Type System • Use any Java/Scala classes as a data type – Special support for tuples, POJOs, and case classes – Not restricted to key-value pairs • Define (composite) keys directly on data types – Expression – Tuple position – Selector function 11
  • 12. Counting Words in Batch and Stream 12 case class WordCount (word: String, count: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => WordCount(word,1))} .window(Count.of(1000)).every(Count.of(100)) .groupBy("word").sum("count") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => WordCount(word,1))} .groupBy("word").sum("count") .print() DataSet API (batch): DataStream API (streaming):
  • 13. Table API • Execute SQL-like expressions on table data – Tight integration with Java and Scala APIs – Available for batch and streaming programs val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5) val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio) 13
  • 14. WHAT IS HAPPENING INSIDE? Processing Engine 14
  • 16. Lots of cool technology inside Flink • Batch and Streaming in one system • Memory-safe execution • Native data flow iterations • Cost-based data flow optimizer • Flexible windows on data streams • Type extraction and custom serialization utilities • Static code analysis on user functions • and much more... 16
  • 17. STREAM AND BATCH IN ONE SYSTEM Pipelined Data Transfer 17
  • 18. Stream and Batch in one System • Most systems do either stream or batch • In the past, Flink focused on batch processing – Flink‘s runtime has always done stream processing – Operators pipeline data forward as soon as it is processed – Some operators are blocking (such as sort) • Pipelining gives better performance for many batch workloads – Avoids materialization of large intermediate results 18
  • 20. Flink Data Stream Processing • Pipelining enables Flink‘s runtime to process streams – API to define stream programs – Operators for stream processing • Stream API and operators are recent contributions – Evolving very quickly under heavy development – Integration with batch mode 20
  • 21. MEMORY SAFE EXECUTION Memory Management and Out-of-Core Algorithms 21
  • 22. Memory-safe Execution • Challenge of JVM-based data processing systems – OutOfMemoryErrors due to data objects on the heap • Flink runs complex data flows without memory tuning – C++-style memory management – Robust out-of-core algorithms 22
  • 23. Managed Memory • Active memory management – Workers allocate 70% of JVM memory as byte arrays – Algorithms serialize data objects into byte arrays – In-memory processing as long as data is small enough – Otherwise partial destaging to disk • Benefits – Safe memory bounds (no OutOfMemoryError) – Scales to very large JVMs – Reduced GC pressure 23
  • 24. Going out-of-core Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core 24
  • 25. GRAPH ANALYSIS Native Data Flow Iterations 25
  • 26. Native Data Flow Iterations • Many graph and ML algorithms require iterations • Flink features two types of iterations – Bulk iterations – Delta iterations 26 2 1 5 4 3 0.1 0.5 0.2 0.4 0.7 0.3 0.9
  • 27. Iterative Data Flows • Flink runs iterations „natively“ as cyclic data flows – No loop unrolling – Operators are scheduled once – Data is fed back through backflow channel – Loop-invariant data is cached • Operator state is preserved across iterations! initial input Iteration head resultreducejoin Iteration tail other datasets 27
  • 28. Delta Iterations • Delta iteration computes – Delta update of solution set – Work set for next iteration • Work set drives computations of next iteration – Workload of later iterations significantly reduced – Fast convergence • Applicable to certain problem domains – Graph processing 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 5 #ofelementsupdated # of iterations 28
  • 29. Iteration Performance PageRank on Twitter Follower Graph 30 Iterations 61 Iterations (Convergence) 29
  • 30. WHAT IS COMING NEXT? Roadmap 30
  • 31. Just released a new version • 0.9.0-milestone1 preview release just got out! • Features – Table API – Flink ML Library – Gelly Library – Flink on Tez – … 31
  • 32. Flink’s Roadmap Mission: Unified stream and batch processing • Exactly-once streaming semantics with flexible state checkpointing • Extending the ML library • Extending graph library • Integration with Apache Zeppelin (incubating) • SQL on top of the Table API • And much more… 32
  • 33. tl;dr – What’s worth to remember? • Flink is a general-purpose data analytics system • Unifies streaming and batch processing • Expressive high-level APIs • Robust and fast execution engine 33
  • 34. I Flink, do you? ;-) If you find Flink exciting, get involved and start a discussion on Flink‘s ML or stay tuned by subscribing to news@flink.apache.org or following @ApacheFlink on Twitter 34
  • 35. 35
  • 37. Data Flow Optimizer • Database-style optimizations for parallel data flows • Optimizes all batch programs • Optimizations – Task chaining – Join algorithms – Re-use partitioning and sorting for later operations – Caching for iterations 37
  • 38. Data Flow Optimizer val orders = … val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”) .sum(“l.extdPrice”); 38
  • 39. Data Flow Optimizer DataSource orders.tbl Filter DataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine Reduce sort[0,1] DataSource orders.tbl Filter DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] Reduce sort[0,1] Best plan depends on relative sizes of input filespartial sort[0,1] 39