SlideShare a Scribd company logo
1 of 33
Google Cloud Dataflow
On Top of Apache Flink
Maximilian Michels
mxm@apache.org
@stadtlegende
Contents
§  Google Cloud Dataflow and Flink
§  The Dataflow API
§  From Dataflow to Flink
§  Translating Dataflow Map/Reduce
§  Demo
2
Google Cloud Dataflow
§  Developed by Google
§  Based on the concepts of
•  FlumeJava (batch)
•  MillWheel (streaming)
§  Perfect integration into Google’s infrastructure
and services
•  Google Compute Engine
•  Google Cloud Storage
•  Google BigQuery
•  Resource management
•  Monitoring
•  Optimization
3
Motivation
§  Execute on the Google Cloud Platform
•  Very fast and dynamic infrastructure
•  Scale in and out as you wish
•  Make use of Google’s provided services
§  Execute using Apache Flink
•  Run your own infrastructure (avoid lock-in)
•  Control your data and software
•  Extend it using open source components
§  Wouldn’t it be great if you could choose?
•  Unified batch and streaming API
•  Similar concepts in batch and streaming
•  More options
4
The Dataflow API
5
The Dataflow API
PCollection
A parallel collection of records which can be either bound (batch) or
unbound (streaming)
PTransform
A transformation that can be applied to a parallel collection
Pipeline
A data structure for holding the dataflow graph
PipelineRunner
A parallel execution engine, e.g. DirectPipeline, DataflowPipeline, or
FlinkPipeline
6
WordCount in Dataflow #1
7
public static void main(String[] args) {
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(new CountWords())
.apply(TextIO.Write.to("gs://my-bucket/wordcounts"));
p.run();
}
Word Count Dataflow #2
public static class CountWords extends
PTransform<PCollection<String>,PCollection<KV<String, Long>>> {
@Override
public PCollection<KV<String, Long>> apply(
PCollection<String> lines) {
// Convert lines of text into individual words.
PCollection<String> words = lines.apply(
ParDo.of(new ExtractWordsFn()));
// Count the number of times each word occurs.
PCollection<KV<String, Long>> wordCounts =
words.apply(Count.perElement());
return wordCounts;
}
}
8
Count	
  Words	
  
Word Count Dataflow #3
public static class ExtractWordsFn extends DoFn<String, String> {
@Override
public void processElement(ProcessContext context) {
String[] words = context.element().split("[^a-zA-Z']+");
for (String word : words) {
if (!word.isEmpty()) {
context.output(word);
}
}
}
}
9
Extract	
  Words	
  
Word Count Dataflow #4
public static class PerElement<T>
extends PTransform<PCollection<T>, PCollection<KV<T, Long>>> {
@Override
public PCollection<KV<T, Long>> apply(PCollection<T> input) {
input.apply(ParDo.of(new DoFn<T, KV<T, Void>>() {
@Override
public void processElement(ProcessContext c) {
c.output(KV.of(c.element(), (Void) null));
}
}))
.apply(Count.perKey());
}
} 10
Count	
  
From Dataflow to Flink
11
From Dataflow to Flink
public class MinimalWordCount {
public static void main(String[] args) {
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(options);
// Apply the pipeline's transforms.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
@Override
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^a-zA-Z']+")) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(Count.<String>perElement())
.apply(ParDo.named("FormatResults").of(new DoFn<KV<String, Long>,
String>() {
private static final long serialVersionUID = 0;
@Override
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + ": " + c.element().getValue());
}
.apply(TextIO.Write.to("gs://my-bucket/wordcounts"));
// Run the pipeline.
p.run();
}
}
12
Dataflow	
   Flink	
  
PCollec(on	
   DataSet	
  /	
  DataStream	
  
PTransform	
   Operator	
  
Pipeline	
   Execu(onEnvironment	
  
PipelineRunner	
   Flink!	
  
public class MinimalWordCount {
public static void main(String[] args) {
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(options);
// Apply the pipeline's transforms.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
@Override
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^a-zA-Z']+")) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(Count.<String>perElement())
.apply(ParDo.named("FormatResults").of(new DoFn<KV<String, Long>,
String>() {
private static final long serialVersionUID = 0;
@Override
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + ": " + c.element().getValue());
}
.apply(TextIO.Write.to("gs://my-bucket/wordcounts"));
// Run the pipeline.
p.run();
}
}
The Dataflow SDK
§  Apache 2.0 licensed
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
§  Only Java (for now)
§  1.0.0 released in June
§  Built with modularity in mind
§  Execution engine can be exchanged
§  Pipeline can be traversed by a visitor
§  Custom runners can change the translation
and execution process
13
A Dataflow is an AST
Dataflow	
  
Program	
  
Transform	
  
Transform	
  	
  
Transform	
  	
   Transform	
  	
  
Transform	
  	
  
Transform	
  	
  
14
The WordCount AST
RootTransform	
  
TextIO.Read	
  	
  	
  	
  	
  	
  	
  	
  
(ReadLines)	
  
CountWords	
  
ParDo	
  
(ExtractWords)	
  
Count.PerElement	
  
ParDo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
(Init)	
  
Combine.PerKey	
  
(Sum.PerKey)	
  
GroupByKey	
  
GroupByKeyOnly	
  
GroupedValues	
  
ParDo	
  
ParDo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
(Format	
  Counts)	
  
TextIO.Write	
  	
  	
  	
  	
  	
  	
  
(WriteCounts)	
  
15
The WordCount Dataflow
TextIO.Read
(ReadLines) ExtractWords GroupByKey
Combine.PerKey
(Sum.PerKey)
ParDo
(Format Counts)
TextIO.Write
(WriteCounts)
16
§  AST converted to Execution DAG
RootTransform	
  
TextIO.Read	
  	
  	
  	
  	
  	
  	
  	
  
(ReadLines)	
   CountWords	
  
ParDo	
  (ExtractWords)	
   Count.PerElement	
  
ParDo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
(Init)	
  
Combine.PerKey	
  
(Sum.PerKey)	
  
GroupByKey	
  
GroupByKeyOnly	
  
GroupedValues	
  
ParDo	
  
ParDo	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
(Format	
  Counts)	
  
TextIO.Write	
  	
  	
  	
  	
  	
  	
  
(WriteCounts)	
  
Dataflow Translation
17
The WordCount Flink Plan
18
Implementing Map/Reduce
19
Implement a translation
1.  Find out which transform to translate
•  ParDo.Bound
•  Combine.PerKey
2.  Implement TransformTranslator
•  ParDoTranslator
•  CombineTranslator
3.  Register TransformTranslator
•  Translators.add(ParDo, DoFnTranslator)
•  Translators.add(Combine, CombineTranslator)
20
ParDo à Map
§  ParDo has DoFn function that performs
the map and contains the user code
1.  Create a FlinkDoFnFunction which wraps
a DoFn function
2.  Create a translation using this function
as a function of Flink’s MapOperator
21
Step 1: ParDo à Map
22
public class FlinkDoFnFunction<IN, OUT> extends
RichMapPartitionFunction<IN, OUT> {
private final DoFn<IN, OUT> doFn;
public FlinkDoFnFunction(DoFn<IN, OUT> doFn) {
this.doFn = doFn;
}
@Override
public void mapPartition(Iterable<IN> values, Collector<OUT> out) {
for (IN value : values) {
doFn.processElement(value);
}
}
}
Step 2: ParDo à Map
23
private static class ParDoBoundTranslator<IN, OUT> implements
FlinkPipelineTranslator.TransformTranslator<ParDo.Bound<IN, OUT>> {
@Override
public void translateNode(ParDo.Bound<IN, OUT> transform,
TranslationContext context) {
DataSet<IN> inputDataSet = context.getInputDataSet(transform.getInput());
final DoFn<IN, OUT> doFn = transform.getFn();
TypeInformation<OUT> typeInformation =
context.getTypeInfo(transform.getOutput());
FlinkDoFnFunction<IN, OUT> fnWrapper =
new FlinkDoFnFunction<>(doFn, context.getPipelineOptions());
MapPartitionOperator<IN, OUT> outputDataSet =
new MapPartitionOperator<>(inputDataSet, typeInformation, fnWrapper);
context.setOutputDataSet(transform.getOutput(), outputDataSet);
}
}
Combine à Reduce
§  Groups by key (locally)
§  Combines the values using a combine fn
§  Groups by key (shuffle)
§  Reduces the combined values using combine fn
1.  Create a FlinkCombineFunction to wrap
combine fn
2.  Create a FlinkReduceFunction to wrap combine
fn
3.  Create a translation using these functions in
Flink Operators
24
The Flink Dataflow Runner
25
FlinkPipelineRunner
§  Available on GitHub
§  https://github.com/dataArtisans/flink-dataflow
§  Only batch support at the moment
§  Execution based on Flink 0.9.1
Roadmap
§  Streaming (after Flink 0.10 is out)
§  More transformations
§  Coder optimization
26
Supported Transforms (WIP)
27
Dataflow	
  Transform	
   Flink	
  Operator	
  
Create.Values	
  	
   FromElements	
  
View.CreatePCollec(onView	
  	
   BroadCastSet	
  
FlaDen.FlaDenPCollec(onList	
  	
   Union	
  
GroupByKey.GroupByKeyOnly	
  	
   GroupBy	
  
ParDo.Bound	
  	
   Map	
  
ParDo.BoundMul(	
  	
   MapWithMul(pleOutput	
  
Combine.PerKey.class	
  	
   Reduce	
  
CoGroupByKey	
  	
   CoGroup	
  
TextIO.Read.Bound	
  	
   ReadFromTextFile	
  
TextIO.Write.Bound	
  	
   WriteToTextFile	
  
ConsoleIO.Write.Bound	
  	
   Print	
  
AvroIO.Read.Bound	
  	
   AvroRead	
  
AvroIO.Write.Bound	
  	
   AvroWrite	
  
Types & Coders
§  Flink has a very efficient type serialization
system
§  Serialization is needed for sending data
over to the wire or between processes
§  Flink may even work on serialized data
§  The TypeExtractor extracts the return
types of operators
§  Following operators make use of this
information
28
Types & Coders continued
§  Coders are Dataflow serializers
§  Should we use Flink’s type serialization
system or Dataflow’s?
§  Decision: use Dataflow coders
•  Full API support (e.g. custom Coders)
•  Comparing may require serialization or
deserialization of entire Object (instead of
just the key)
29
Challenges & Lessons Learned
§  Dataflow’s API model is suited well for
translation into Flink
§  Efficient translations can be tricky
§  For example: WordCount from 6 hours to
1 hour using a combiner and better
coder type serialization
§  Implement a dedicated Combine-only
operator in Flink
30
How To User the Runner
§  Instructions also on the GitHub page
https://github.com/dataArtisans/flink-dataflow
1.  Build and install flink-dataflow using
Maven
2.  Include flink-dataflow as a dependency
in your Maven project
3.  Set FlinkDataflowRunner as a runner
4.  Build a fat jar including flink-dataflow
5.  Submit to the cluster using ./bin/flink
31
Demo
32
That’s all Folks!
§  Check out the Flink Dataflow runner!
§  Write your programs once and execute
on two engines
§  Provide feedback and report issues on
GitHub
§  Experience the unified batch and
streaming platform through Dataflow
and Flink
33

More Related Content

What's hot

FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache FlinkJuan Fumero
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14宇帆 盛
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetupKostas Tzoumas
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingFlink Forward
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Andra Lungu
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
 

What's hot (20)

FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache Flink
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink:  Fast and reliable large-scale data processingJanuary 2015 HUG: Apache Flink:  Fast and reliable large-scale data processing
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
First Flink Bay Area meetup
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache flink
Apache flinkApache flink
Apache flink
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 

Viewers also liked

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeFlink Forward
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFlink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkFlink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsFlink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkFlink Forward
 
Aljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeAljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeFlink Forward
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 

Viewers also liked (20)

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
Aljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of TimeAljoscha Krettek – Notions of Time
Aljoscha Krettek – Notions of Time
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 

Similar to Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink

Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData InfluxData
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamImre Nagi
 
JCConf 2016 - Dataflow Workshop Labs
JCConf 2016 - Dataflow Workshop LabsJCConf 2016 - Dataflow Workshop Labs
JCConf 2016 - Dataflow Workshop LabsSimon Su
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for DevelopersMSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for DevelopersDave Bost
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoKetanUmare
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...Codemotion
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Automatically Documenting Program Changes
Automatically Documenting Program ChangesAutomatically Documenting Program Changes
Automatically Documenting Program ChangesRay Buse
 
Pratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile Apps
Pratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile AppsPratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile Apps
Pratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile AppsAxway Appcelerator
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsAzul Systems, Inc.
 
Visual Studio .NET2010
Visual Studio .NET2010Visual Studio .NET2010
Visual Studio .NET2010Satish Verma
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKnagachika t
 
Introduction to Go language
Introduction to Go languageIntroduction to Go language
Introduction to Go languageTzar Umang
 

Similar to Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink (20)

Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
JCConf 2016 - Dataflow Workshop Labs
JCConf 2016 - Dataflow Workshop LabsJCConf 2016 - Dataflow Workshop Labs
JCConf 2016 - Dataflow Workshop Labs
 
Gwt and Xtend
Gwt and XtendGwt and Xtend
Gwt and Xtend
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for DevelopersMSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Hexagonal architecture in PHP
Hexagonal architecture in PHPHexagonal architecture in PHP
Hexagonal architecture in PHP
 
Automatically Documenting Program Changes
Automatically Documenting Program ChangesAutomatically Documenting Program Changes
Automatically Documenting Program Changes
 
Pratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile Apps
Pratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile AppsPratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile Apps
Pratik Patel: Titanium as Platform: Feature-Rich, Database-Driven Mobile Apps
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
 
Visual Studio .NET2010
Visual Studio .NET2010Visual Studio .NET2010
Visual Studio .NET2010
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
Introduction to Go language
Introduction to Go languageIntroduction to Go language
Introduction to Go language
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink

  • 1. Google Cloud Dataflow On Top of Apache Flink Maximilian Michels mxm@apache.org @stadtlegende
  • 2. Contents §  Google Cloud Dataflow and Flink §  The Dataflow API §  From Dataflow to Flink §  Translating Dataflow Map/Reduce §  Demo 2
  • 3. Google Cloud Dataflow §  Developed by Google §  Based on the concepts of •  FlumeJava (batch) •  MillWheel (streaming) §  Perfect integration into Google’s infrastructure and services •  Google Compute Engine •  Google Cloud Storage •  Google BigQuery •  Resource management •  Monitoring •  Optimization 3
  • 4. Motivation §  Execute on the Google Cloud Platform •  Very fast and dynamic infrastructure •  Scale in and out as you wish •  Make use of Google’s provided services §  Execute using Apache Flink •  Run your own infrastructure (avoid lock-in) •  Control your data and software •  Extend it using open source components §  Wouldn’t it be great if you could choose? •  Unified batch and streaming API •  Similar concepts in batch and streaming •  More options 4
  • 6. The Dataflow API PCollection A parallel collection of records which can be either bound (batch) or unbound (streaming) PTransform A transformation that can be applied to a parallel collection Pipeline A data structure for holding the dataflow graph PipelineRunner A parallel execution engine, e.g. DirectPipeline, DataflowPipeline, or FlinkPipeline 6
  • 7. WordCount in Dataflow #1 7 public static void main(String[] args) { DataflowPipelineOptions options = PipelineOptionsFactory.create() .as(DataflowPipelineOptions.class); options.setRunner(DataflowPipelineRunner.class); Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(new CountWords()) .apply(TextIO.Write.to("gs://my-bucket/wordcounts")); p.run(); }
  • 8. Word Count Dataflow #2 public static class CountWords extends PTransform<PCollection<String>,PCollection<KV<String, Long>>> { @Override public PCollection<KV<String, Long>> apply( PCollection<String> lines) { // Convert lines of text into individual words. PCollection<String> words = lines.apply( ParDo.of(new ExtractWordsFn())); // Count the number of times each word occurs. PCollection<KV<String, Long>> wordCounts = words.apply(Count.perElement()); return wordCounts; } } 8 Count  Words  
  • 9. Word Count Dataflow #3 public static class ExtractWordsFn extends DoFn<String, String> { @Override public void processElement(ProcessContext context) { String[] words = context.element().split("[^a-zA-Z']+"); for (String word : words) { if (!word.isEmpty()) { context.output(word); } } } } 9 Extract  Words  
  • 10. Word Count Dataflow #4 public static class PerElement<T> extends PTransform<PCollection<T>, PCollection<KV<T, Long>>> { @Override public PCollection<KV<T, Long>> apply(PCollection<T> input) { input.apply(ParDo.of(new DoFn<T, KV<T, Void>>() { @Override public void processElement(ProcessContext c) { c.output(KV.of(c.element(), (Void) null)); } })) .apply(Count.perKey()); } } 10 Count  
  • 11. From Dataflow to Flink 11
  • 12. From Dataflow to Flink public class MinimalWordCount { public static void main(String[] args) { DataflowPipelineOptions options = PipelineOptionsFactory.create() .as(DataflowPipelineOptions.class); options.setRunner(BlockingDataflowPipelineRunner.class); // Create the Pipeline object with the options we defined above. Pipeline p = Pipeline.create(options); // Apply the pipeline's transforms. p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() { private static final long serialVersionUID = 0; @Override public void processElement(ProcessContext c) { for (String word : c.element().split("[^a-zA-Z']+")) { if (!word.isEmpty()) { c.output(word); } } } })) .apply(Count.<String>perElement()) .apply(ParDo.named("FormatResults").of(new DoFn<KV<String, Long>, String>() { private static final long serialVersionUID = 0; @Override public void processElement(ProcessContext c) { c.output(c.element().getKey() + ": " + c.element().getValue()); } .apply(TextIO.Write.to("gs://my-bucket/wordcounts")); // Run the pipeline. p.run(); } } 12 Dataflow   Flink   PCollec(on   DataSet  /  DataStream   PTransform   Operator   Pipeline   Execu(onEnvironment   PipelineRunner   Flink!   public class MinimalWordCount { public static void main(String[] args) { DataflowPipelineOptions options = PipelineOptionsFactory.create() .as(DataflowPipelineOptions.class); options.setRunner(BlockingDataflowPipelineRunner.class); // Create the Pipeline object with the options we defined above. Pipeline p = Pipeline.create(options); // Apply the pipeline's transforms. p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(ParDo.named("ExtractWords").of(new DoFn<String, String>() { private static final long serialVersionUID = 0; @Override public void processElement(ProcessContext c) { for (String word : c.element().split("[^a-zA-Z']+")) { if (!word.isEmpty()) { c.output(word); } } } })) .apply(Count.<String>perElement()) .apply(ParDo.named("FormatResults").of(new DoFn<KV<String, Long>, String>() { private static final long serialVersionUID = 0; @Override public void processElement(ProcessContext c) { c.output(c.element().getKey() + ": " + c.element().getValue()); } .apply(TextIO.Write.to("gs://my-bucket/wordcounts")); // Run the pipeline. p.run(); } }
  • 13. The Dataflow SDK §  Apache 2.0 licensed https://github.com/GoogleCloudPlatform/DataflowJavaSDK §  Only Java (for now) §  1.0.0 released in June §  Built with modularity in mind §  Execution engine can be exchanged §  Pipeline can be traversed by a visitor §  Custom runners can change the translation and execution process 13
  • 14. A Dataflow is an AST Dataflow   Program   Transform   Transform     Transform     Transform     Transform     Transform     14
  • 15. The WordCount AST RootTransform   TextIO.Read                 (ReadLines)   CountWords   ParDo   (ExtractWords)   Count.PerElement   ParDo                                               (Init)   Combine.PerKey   (Sum.PerKey)   GroupByKey   GroupByKeyOnly   GroupedValues   ParDo   ParDo                                   (Format  Counts)   TextIO.Write               (WriteCounts)   15
  • 16. The WordCount Dataflow TextIO.Read (ReadLines) ExtractWords GroupByKey Combine.PerKey (Sum.PerKey) ParDo (Format Counts) TextIO.Write (WriteCounts) 16 §  AST converted to Execution DAG RootTransform   TextIO.Read                 (ReadLines)   CountWords   ParDo  (ExtractWords)   Count.PerElement   ParDo                                               (Init)   Combine.PerKey   (Sum.PerKey)   GroupByKey   GroupByKeyOnly   GroupedValues   ParDo   ParDo                                   (Format  Counts)   TextIO.Write               (WriteCounts)  
  • 20. Implement a translation 1.  Find out which transform to translate •  ParDo.Bound •  Combine.PerKey 2.  Implement TransformTranslator •  ParDoTranslator •  CombineTranslator 3.  Register TransformTranslator •  Translators.add(ParDo, DoFnTranslator) •  Translators.add(Combine, CombineTranslator) 20
  • 21. ParDo à Map §  ParDo has DoFn function that performs the map and contains the user code 1.  Create a FlinkDoFnFunction which wraps a DoFn function 2.  Create a translation using this function as a function of Flink’s MapOperator 21
  • 22. Step 1: ParDo à Map 22 public class FlinkDoFnFunction<IN, OUT> extends RichMapPartitionFunction<IN, OUT> { private final DoFn<IN, OUT> doFn; public FlinkDoFnFunction(DoFn<IN, OUT> doFn) { this.doFn = doFn; } @Override public void mapPartition(Iterable<IN> values, Collector<OUT> out) { for (IN value : values) { doFn.processElement(value); } } }
  • 23. Step 2: ParDo à Map 23 private static class ParDoBoundTranslator<IN, OUT> implements FlinkPipelineTranslator.TransformTranslator<ParDo.Bound<IN, OUT>> { @Override public void translateNode(ParDo.Bound<IN, OUT> transform, TranslationContext context) { DataSet<IN> inputDataSet = context.getInputDataSet(transform.getInput()); final DoFn<IN, OUT> doFn = transform.getFn(); TypeInformation<OUT> typeInformation = context.getTypeInfo(transform.getOutput()); FlinkDoFnFunction<IN, OUT> fnWrapper = new FlinkDoFnFunction<>(doFn, context.getPipelineOptions()); MapPartitionOperator<IN, OUT> outputDataSet = new MapPartitionOperator<>(inputDataSet, typeInformation, fnWrapper); context.setOutputDataSet(transform.getOutput(), outputDataSet); } }
  • 24. Combine à Reduce §  Groups by key (locally) §  Combines the values using a combine fn §  Groups by key (shuffle) §  Reduces the combined values using combine fn 1.  Create a FlinkCombineFunction to wrap combine fn 2.  Create a FlinkReduceFunction to wrap combine fn 3.  Create a translation using these functions in Flink Operators 24
  • 25. The Flink Dataflow Runner 25
  • 26. FlinkPipelineRunner §  Available on GitHub §  https://github.com/dataArtisans/flink-dataflow §  Only batch support at the moment §  Execution based on Flink 0.9.1 Roadmap §  Streaming (after Flink 0.10 is out) §  More transformations §  Coder optimization 26
  • 27. Supported Transforms (WIP) 27 Dataflow  Transform   Flink  Operator   Create.Values     FromElements   View.CreatePCollec(onView     BroadCastSet   FlaDen.FlaDenPCollec(onList     Union   GroupByKey.GroupByKeyOnly     GroupBy   ParDo.Bound     Map   ParDo.BoundMul(     MapWithMul(pleOutput   Combine.PerKey.class     Reduce   CoGroupByKey     CoGroup   TextIO.Read.Bound     ReadFromTextFile   TextIO.Write.Bound     WriteToTextFile   ConsoleIO.Write.Bound     Print   AvroIO.Read.Bound     AvroRead   AvroIO.Write.Bound     AvroWrite  
  • 28. Types & Coders §  Flink has a very efficient type serialization system §  Serialization is needed for sending data over to the wire or between processes §  Flink may even work on serialized data §  The TypeExtractor extracts the return types of operators §  Following operators make use of this information 28
  • 29. Types & Coders continued §  Coders are Dataflow serializers §  Should we use Flink’s type serialization system or Dataflow’s? §  Decision: use Dataflow coders •  Full API support (e.g. custom Coders) •  Comparing may require serialization or deserialization of entire Object (instead of just the key) 29
  • 30. Challenges & Lessons Learned §  Dataflow’s API model is suited well for translation into Flink §  Efficient translations can be tricky §  For example: WordCount from 6 hours to 1 hour using a combiner and better coder type serialization §  Implement a dedicated Combine-only operator in Flink 30
  • 31. How To User the Runner §  Instructions also on the GitHub page https://github.com/dataArtisans/flink-dataflow 1.  Build and install flink-dataflow using Maven 2.  Include flink-dataflow as a dependency in your Maven project 3.  Set FlinkDataflowRunner as a runner 4.  Build a fat jar including flink-dataflow 5.  Submit to the cluster using ./bin/flink 31
  • 33. That’s all Folks! §  Check out the Flink Dataflow runner! §  Write your programs once and execute on two engines §  Provide feedback and report issues on GitHub §  Experience the unified batch and streaming platform through Dataflow and Flink 33