SlideShare a Scribd company logo
Apache Flink™ deep-dive
Unified Batch and Stream Processing
Robert Metzger
@rmetzger_
Hadoop Summit 2015,
San Jose, CA
Flink’s Recent History
April 2014 April 2015Dec 2014
Top Level
Project
Graduation
0.70.60.5 0.90.9-m1
What is Flink
3
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Remote YARN Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
Zeppelin
Program compilation
4
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
Master
Workers
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow Graph
Independent of
batch or
streaming job
deploy
operators
track
intermediate
results
 Layered Architecture
allows plugging of
components
Native workload support
5
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
 Low latency
 resource utilization  iterative algorithms
 Mutable state
E.g.: Non-native iterations
6
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
 Teaching an old elephant new tricks
 Treat system as a black box
E.g.: Non-native streaming
7
stream
discretizer
Job Job Job Job
while (true) {
// get next few records
// issue batch job
}
Data Stream
 Simulate stream processor with batch system
Native workload support
8
Flink
Streaming
topologies
Long batch
pipelines
Machine Learning at scale
How can an engine natively support all these workloads?
And what does "native" mean?
Graph Analysis
 Low latency
 resource utilization  iterative algorithms
 Mutable state
Ingredients for “native” support
1. Execute everything as streams
Pipelined execution, push model
2. Special code paths for batch
Automatic job optimization, fault tolerance
3. Allow some iterative (cyclic) dataflows
4. Allow some mutable state
5. Operate on managed memory
Make data processing on the JVM robust
9
Flink by Use Case
10
Stream data processing
streaming dataflows
11
Full talk tomorrow:
3:10PM, Grand Ballroom 220A
Stream processing with Flink
Pipelined stream processor
12
Streaming
Shuffle!
 Low latency
 Operators push data
forward
Expressive APIs
13
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Checkpointing / Recovery
14
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
(backup till next snapshot)
 Guarantees exactly-once
processing
Batch processing
Batch on Streaming
15
Batch on an streaming engine
16
File in HDFS
Filter Map Result 1
Map Result 2
 Batch program, completely pipelined
 Data is never materialized anywhere (in this example)
Batch on an streaming engine
Map
Operator
Map
Operator
Map
Operator
17
Data
Source
(small)
Stream
Data
Sink
Data
Sink
Data
Sink
Join
Operator
in parallel
Data
Source
(large)
Data
Sink
in parallel (once build side finished)
Map
Batch processing requirements
 Get the data processed as fast as possible
• Automatic job optimizer
• Efficient memory management
 Robust processing
• provide fault-tolerance
• again, memory management
18
Optimizer
 Cost-based optimizer
 Select data shipping strategy (forward, partition, broadcast)
 Local execution (sort merge join/hash join)
 Caching of loop invariant data (iterations)
19
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Pre-flight (Client)
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
Two execution plans
20
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Best plan
depends on
relative sizes
of input files
Memory Management
21
Operators on managed memory
22
Smooth out-of-core performance
23
More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Blue bars are in-memory, orange bars (partially) out-of-core
Machine Learning Algorithms
Iterative data flows
24
Iterate in the Dataflow
26
 API and runtime support
 Automatic caching of loop invariant
data
IterationState state =
getInitialState();
while (!terminationCriterion()) {
state = step(state);
}
setFinalState(state);
Example: Matrix Factorization
27
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
Setups:
• 40 medium instances ("n1-highmem-8" - 8
cores, 52 GB)
• 40 large instances ("n1-highmem-16" - 16
cores, 104 GB)
Flink ML – Machine Learning
 Provide a complete toolchain
• scikit-learn style pipelining
• Data pre-processing
 various algorithms
• Recommendations: ALS
• Supervised learning: Support Vector Machines
• …
 ML on streams: SAMOA. We are planning to add support for
streaming into ML
28
Graph Analysis
Stateful Iterations
29
Graph processing characteristics
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
#ofelementsupdated
iteration
Iterate natively with state/deltas
31
 Keep state in an controlled way by having a partitioned hash-
map
 Relax immutability assumption of batch processing
… fast graph analysis
32More at: http://data-artisans.com/data-analysis-with-flink.html
Gelly – Graph Processing API
33
 Transformations: map, filter, subgraph, union, reverse,
undirected
 Mutations: add vertex/edge, remove …
 Pregel style vertex centric iterations
 Library of algorithms
 Utilities: Special data types, loading, graph properties
Gelly and Flink ML:
 Available in Flink 0.9 (so far only beta release)
 Still under heavy development
 Seamlessly integrate with DataSet abstraction
Preprocess data as needed
Use results as needed
 Easy entry point for new contributors
34
Closing
35
Flink Meetup Groups
 SF Spark and Friends
• June 16, San Francisco
 Bay Area Flink Meetup
• June 17, Redwood City
 Chicago Flink Meetup
• June 30
 Stockholm, Sweden
 Berlin, Germany
36
Flink Forward registration & call for
abstracts is open now
flink.apache.org 37
• 12/13 October 2015
• Meet developers and users of Flink!
• With Flink Workshops / Trainings!
flink.apache.org
@ApacheFlink

More Related Content

What's hot

Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
Botond Balázs
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
Datio Big Data
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
Databricks
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 

What's hot (20)

Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 

Similar to Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA

Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
Kostas Tzoumas
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Robert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Till Rohrmann
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
mxmxm
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Ontico
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Gyula Fóra
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
Leonardo Gamas
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
DataWorks Summit
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
Vasia Kalavri
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Stephan Ewen
 

Similar to Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA (20)

Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 

More from Robert Metzger

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
Robert Metzger
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
Robert Metzger
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
Robert Metzger
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
Robert Metzger
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Robert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
Robert Metzger
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
Robert Metzger
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
Robert Metzger
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Robert Metzger
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
Robert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
Robert Metzger
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
Robert Metzger
 
Berlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community UpdateBerlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community Update
Robert Metzger
 
Flink Community Update April 2015
Flink Community Update April 2015Flink Community Update April 2015
Flink Community Update April 2015
Robert Metzger
 

More from Robert Metzger (20)

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Berlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community UpdateBerlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community Update
 
Flink Community Update April 2015
Flink Community Update April 2015Flink Community Update April 2015
Flink Community Update April 2015
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA

  • 1. Apache Flink™ deep-dive Unified Batch and Stream Processing Robert Metzger @rmetzger_ Hadoop Summit 2015, San Jose, CA
  • 2. Flink’s Recent History April 2014 April 2015Dec 2014 Top Level Project Graduation 0.70.60.5 0.90.9-m1
  • 3. What is Flink 3 Gelly Table ML SAMOA DataSet (Java/Scala) DataStream HadoopM/R Local Remote YARN Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime Zeppelin
  • 4. Program compilation 4 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph Independent of batch or streaming job deploy operators track intermediate results  Layered Architecture allows plugging of components
  • 5. Native workload support 5 Flink Streaming topologies Long batch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis  Low latency  resource utilization  iterative algorithms  Mutable state
  • 6. E.g.: Non-native iterations 6 Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }  Teaching an old elephant new tricks  Treat system as a black box
  • 7. E.g.: Non-native streaming 7 stream discretizer Job Job Job Job while (true) { // get next few records // issue batch job } Data Stream  Simulate stream processor with batch system
  • 8. Native workload support 8 Flink Streaming topologies Long batch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis  Low latency  resource utilization  iterative algorithms  Mutable state
  • 9. Ingredients for “native” support 1. Execute everything as streams Pipelined execution, push model 2. Special code paths for batch Automatic job optimization, fault tolerance 3. Allow some iterative (cyclic) dataflows 4. Allow some mutable state 5. Operate on managed memory Make data processing on the JVM robust 9
  • 10. Flink by Use Case 10
  • 11. Stream data processing streaming dataflows 11 Full talk tomorrow: 3:10PM, Grand Ballroom 220A Stream processing with Flink
  • 12. Pipelined stream processor 12 Streaming Shuffle!  Low latency  Operators push data forward
  • 13. Expressive APIs 13 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 14. Checkpointing / Recovery 14 Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots Pushes checkpoint barriers through the data flow Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot (backup till next snapshot)  Guarantees exactly-once processing
  • 16. Batch on an streaming engine 16 File in HDFS Filter Map Result 1 Map Result 2  Batch program, completely pipelined  Data is never materialized anywhere (in this example)
  • 17. Batch on an streaming engine Map Operator Map Operator Map Operator 17 Data Source (small) Stream Data Sink Data Sink Data Sink Join Operator in parallel Data Source (large) Data Sink in parallel (once build side finished) Map
  • 18. Batch processing requirements  Get the data processed as fast as possible • Automatic job optimizer • Efficient memory management  Robust processing • provide fault-tolerance • again, memory management 18
  • 19. Optimizer  Cost-based optimizer  Select data shipping strategy (forward, partition, broadcast)  Local execution (sort merge join/hash join)  Caching of loop invariant data (iterations) 19 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Pre-flight (Client) Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph
  • 20. Two execution plans 20 DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine GroupRed sort DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forward Best plan depends on relative sizes of input files
  • 22. Operators on managed memory 22
  • 23. Smooth out-of-core performance 23 More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Blue bars are in-memory, orange bars (partially) out-of-core
  • 25. Iterate in the Dataflow 26  API and runtime support  Automatic caching of loop invariant data IterationState state = getInitialState(); while (!terminationCriterion()) { state = step(state); } setFinalState(state);
  • 26. Example: Matrix Factorization 27 Factorizing a matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html Setups: • 40 medium instances ("n1-highmem-8" - 8 cores, 52 GB) • 40 large instances ("n1-highmem-16" - 16 cores, 104 GB)
  • 27. Flink ML – Machine Learning  Provide a complete toolchain • scikit-learn style pipelining • Data pre-processing  various algorithms • Recommendations: ALS • Supervised learning: Support Vector Machines • …  ML on streams: SAMOA. We are planning to add support for streaming into ML 28
  • 30. Iterate natively with state/deltas 31  Keep state in an controlled way by having a partitioned hash- map  Relax immutability assumption of batch processing
  • 31. … fast graph analysis 32More at: http://data-artisans.com/data-analysis-with-flink.html
  • 32. Gelly – Graph Processing API 33  Transformations: map, filter, subgraph, union, reverse, undirected  Mutations: add vertex/edge, remove …  Pregel style vertex centric iterations  Library of algorithms  Utilities: Special data types, loading, graph properties
  • 33. Gelly and Flink ML:  Available in Flink 0.9 (so far only beta release)  Still under heavy development  Seamlessly integrate with DataSet abstraction Preprocess data as needed Use results as needed  Easy entry point for new contributors 34
  • 35. Flink Meetup Groups  SF Spark and Friends • June 16, San Francisco  Bay Area Flink Meetup • June 17, Redwood City  Chicago Flink Meetup • June 30  Stockholm, Sweden  Berlin, Germany 36
  • 36. Flink Forward registration & call for abstracts is open now flink.apache.org 37 • 12/13 October 2015 • Meet developers and users of Flink! • With Flink Workshops / Trainings!

Editor's Notes

  1. 40 minutes
  2. Flink is an entire software stack the heart: streaming dataflow engine: think of programs as operators and data flows Kappa architecture: run batch programs on a streaming system Table API: logical representation, sql-style Samoa “on-line learners”
  3. toy program: native transitive closure type extraction: types that go in and out of each operator
  4. Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  5. Used for Machine Learning run the same job over the data multiple times to come up with parameters for a ml model this is how you do it when treating the engine as a black box
  6. If you only have a batch processor: do a lot of small batch jobs LIMITATION: state across the small jobs (batches)
  7. Flink is an analytical system streaming topology: real-time; low latency “native”: build-in support in the system, no working around, no black-box next slide: define native by some “non-native” examples
  8. Corner points / requirements for flink keep data in motion, avoid materialization even though it’s a streaming runtime, have special paths for batch: OPTIMIZER, CHECKPOINTING make the system aware of cyclic data flows, in a controlled way allow operators to have some state, in a controlled way (DELTA-ITERATIONS). relax “traditional” batch assumption flink runs in the jvm, but we want control over memory, not rely on GC
  9. Explain flink by use case
  10. pipelined execution: logical way you would go for low latency. no synchronization barriers, records keep flowing streaming shuffle (for example w/ hash code) push model maintain state inside long lived operators (!= mini batch)
  11. nice, fluent APIs known from the batch world window definitions
  12. Low overhead snapshots using “batched” snapshots Exactly once processing guarantees (without doing mini batches) How does it work: periodically push barriers through the streams. when a barrier reaches an operator, snapshot state when barrier reaches sinks, a checkpoint is completed (secured) multiple parallel checkpoints chop stream into generations (pre checkpointed, post checkpointed)
  13. structure, different title
  14. Measure effect of pipeline parallelism blocking happens in operators (join build side)
  15. in flink operators are running at the same time  need to control memory
  16. example: hash join robust in memory graceful behavior
  17. needed for machine learning
  18. function that encapsulate the transformation “Hadoop job” deploy this once, keep running across iterations (also you can keep state) allow to feed back data to the beginning
  19. 1 Tb of input data many terabytes of intermediate data 40 machines cluster @ google compute
  20. SVM = supervised learning ALS = Recommendation
  21. keep mutable state in a controlled way by having a hash map locally on each machine great documentation on flink website
  22. Weakly connected components, page rank, label propagation,