SlideShare a Scribd company logo
Cloud Dataflow
A Google Service and SDK
The Presenter
Alex Van Boxel, Software Architect, software engineer, devops, tester,...
at Vente-Exclusive.com
Introduction
Twitter @alexvb
Plus +AlexVanBoxel
E-mail alex@vanboxel.be
Web http://alex.vanboxel.be
Dataflow
what, how, where?
Dataflow is...
A set of SDK that define the
programming model that
you use to build your
streaming and batch
processing pipeline (*)
Google Cloud Dataflow is a
fully managed service that
will run and optimize your
pipeline
Dataflow where...
ETL
● Move
● Filter
● Enrich
Analytics
● Streaming Compu
● Batch Compu
● Machine Learning
Composition
● Connecting in Pub/Sub
○ Could be other dataflows
○ Could be something else
NoOps Model
● Resources allocated on demand
● Lifecycle managed by the service
● Optimization
● Rebalancing
Cloud Dataflow Benefits
Unified Programming Model
Unified: Streaming and Batch
Open Sourced
● Java 8 implementation
● Python in the works
Cloud Dataflow Benefits
Unified Programming Model (streaming)
Cloud Dataflow Benefits
DataFlow
(streaming)
DataFlow (batch)
BigQuery
Unified Programming Model (stream and backup)
Cloud Dataflow Benefits
DataFlow
(streaming)
DataFlow (batch)
BigQuery
Unified Programming Model (batch)
Cloud Dataflow Benefits
DataFlow
(streaming)
DataFlow (batch)
BigQuery
Beam
Apache Incubation
DataFlow SDK becomes Apache Beam
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Pipeline first, runtime second
○ focus your data pipelines, not the characteristics runner
● Portability
○ portable across a number of runtime engines
○ runtime based on any number of considerations
■ performance
■ cost
■ scalability
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Unified model
○ Batch and streaming are integrated into a unified model
○ Powerful semantics, such as windowing, ordering and triggering
● Development tooling
○ tools you need to create portable data pipelines quickly and easily
using open-source languages, libraries and tools.
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Scala Implementation
● Alternative Runners
○ Spark Runner from Cloudera Labs
○ Flink Runner from data Artisans
Cloud Dataflow Benefits
DataFlow SDK becomes Apache Beam
● Cloudera
● data Artisans
● Talend
● Cask
● PayPal
Cloud Dataflow Benefits
SDK Primitives
Autopsy of a dataflow pipeline
Pipeline
● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, …
● Transforms: Filter, Join, Aggregate, …
● Windows: Sliding, Sessions, Fixed, …
● Triggers: Correctness….
● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ...
Logical Model
PCollection<T>
● Immutable collection
● Could be bound or unbound
● Created by
○ a backing data store
○ a transformation from other PCollection
○ generated
● PCollection<KV<K,V>>
Dataflow SDK Primitives
ParDo<I,O>
● Parallel Processing of each element in the PCollection
● Developer provide DoFn<I,O>
● Shards processed independent
of each other
Dataflow SDK Primitives
ParDo<I,O>
Interesting concept of side outputs: allows for data sharing
Dataflow SDK Primitives
PTransform
● Combine small operations into bigger transforms
● Standard transform (COUNT, SUM,...)
● Reuse and Monitoring
Dataflow SDK Primitives
public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> {
@Override
public PCollection<Event> apply(PCollection<Legacy> input) {
return input.apply(ParDo.of(new TimestampEvent1Fn()))
.apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class))
.apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20))))
.apply(ParDo.of(new ExtactChannelUserFn()))
}
}
Merging
● GroupByKey
○ Takes a PCollection of KV<K,V>
and groups them
○ Must be keyed
● Flatten
○ Join with same type
Dataflow SDK Primitives
Input/Output
Dataflow SDK Primitives
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
// streamData is Unbounded; apply windowing afterward.
PCollection<String> streamData =
p.apply(PubsubIO.Read.named("ReadFromPubsub")
.topic("/topics/my-topic"));
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> weatherData = p
.apply(BigQueryIO.Read.named("ReadWeatherStations")
.from("clouddataflow-readonly:samples.weather_stations"));
● Multiple Inputs + Outputs
Input/Output
Dataflow SDK Primitives
Primitives on steroids
Autopsy of a dataflow pipeline
Windows
● Split unbounded collections (ex. Pub/Sub)
● Works on bounded collection as well
Dataflow SDK Primitives
.apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20))))
Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));
Triggers
● Time-based
● Data-based
● Composite
Dataflow SDK Primitives
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2))));
14:22Event time 14:21
Input/Output
Dataflow SDK Primitives
14:21 14:22
14:21:25Processing time
Event time
14:2314:22
Watermark
14:21:20 14:21:40
Early trigger Late trigger Late trigger
14:25
14:22Event time 14:21
Input/Output
Dataflow SDK Primitives
14:21 14:22
14:21:25Processing time
Event time
14:2314:22
Watermark
14:21:20 14:21:40
Early trigger Late trigger Late trigger
14:25
Google Cloud Dataflow
The service
What you need
● Google Cloud SDK
○ utilities to authenticate and manage project
● Java IDE
○ Eclipse
○ IntelliJ
● Standard Build System
○ Maven
○ Gradle
Dataflow SDK Primitives
What you need
Dataflow SDK Primitives
subprojects {
apply plugin: 'java'
repositories {
maven {
url "http://jcenter.bintray.com"
}
mavenLocal()
}
dependencies {
compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0'
testCompile 'org.hamcrest:hamcrest-all:1.3'
testCompile 'org.assertj:assertj-core:3.0.0'
testCompile 'junit:junit:4.12'
}
}
How it work
Google Cloud Dataflow Service
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+")))
.withOutputType(new TypeDescriptor<String>() {}))
.apply(Filter.byPredicate((String word) -> !word.isEmpty()))
.apply(Count.<String>perElement())
.apply(MapElements
.via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue())
.withOutputType(new TypeDescriptor<String>() {}))
.apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
Service
● Stage
● Optimize (Flume)
● Execute
Google Cloud Dataflow Service
Optimization
Google Cloud Dataflow Service
Optimization
Google Cloud Dataflow Service
Console
Google Cloud Dataflow Service
Lifecycle
Google Cloud Dataflow Service
Compute Worker Scaling
Google Cloud Dataflow Service
Compute Worker Rescaling
Google Cloud Dataflow Service
Compute Worker Rebalancing
Google Cloud Dataflow Service
Compute Worker Rescaling
Google Cloud Dataflow Service
Lifecycle
Google Cloud Dataflow Service
BigQuery
Google Cloud Dataflow Service
Testability
Test Driven Development Build-In
Testability - Built in
Dataflow Testability
KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>());
input.getValue().add(CalEvent.event("SESSION", "REFERRAL")
.build());
input.getValue().add(CalEvent.event("SESSION", "START")
.data("{"referral":{"Sourc...}").build());
DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new
SessionRepairFn());
fnTester.processElement(input);
List<CalEvent> calEvents = fnTester.peekOutputElements();
assertThat(calEvents).hasSize(2);
CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL");
assertNotNull("REFERRAL should still exist", referral);
CalEvent start = CalUtil.findEvent(calEvents, "START");
assertNotNull("START should still exist", start);
assertNull("START should have referral removed.",
Appendix
Follow the yellow brick road
Paper - Flume Java
FlumeJava: Easy, Efficient Data-Parallel Pipelines
http://pages.cs.wisc.edu/~akella/CS838/F12/838-
CloudPapers/FlumeJava.pdf
More Dataflow Resources
DataFlow/Bean vs Spark
Dataflow/Beam & Spark: A Programming Model
Comparison
https://cloud.google.com/dataflow/blog/dataflow-beam-
and-spark-comparison
More Dataflow Resources
Github - Integrate Dataflow with Luigi
Google Cloud integration for Luigi
https://github.com/alexvanboxel/luigiext-gcloud
More Dataflow Resources
Questions and Answers
The last slide
Twitter @alexvb
Plus +AlexVanBoxel
E-mail alex@vanboxel.be
Web http://alex.vanboxel.be

More Related Content

What's hot

Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Pub/Sub Messaging
Pub/Sub MessagingPub/Sub Messaging
Pub/Sub Messaging
Peter Hanzlik
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
Matillion
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
 
Cloud Pub_Sub
Cloud Pub_SubCloud Pub_Sub
Cloud Pub_Sub
Knoldus Inc.
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
Pradeep Bhadani
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 
Serverless with Google Cloud
Serverless with Google CloudServerless with Google Cloud
Serverless with Google Cloud
Bret McGowen - NYC Google Developer Advocate
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 

What's hot (20)

Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Pub/Sub Messaging
Pub/Sub MessagingPub/Sub Messaging
Pub/Sub Messaging
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Cloud Pub_Sub
Cloud Pub_SubCloud Pub_Sub
Cloud Pub_Sub
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Serverless with Google Cloud
Serverless with Google CloudServerless with Google Cloud
Serverless with Google Cloud
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 

Viewers also liked

Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
Matthias Feys
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
Alex Van Boxel
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
Data Con LA
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
Ryuji Tamagawa
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
Neville Li
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushiGoogle Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Platform - Japan
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Big query
Big queryBig query
Big query
Tanvi Parikh
 
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Cloudera, Inc.
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
C4Media
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Chris Schalk
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
Gabriel Hamilton
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data win
Ken Taylor
 
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawakiGoogle Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
javier ramirez
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 

Viewers also liked (20)

Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of GoogleAn indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushiGoogle Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushi
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Big query
Big queryBig query
Big query
 
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
Hadoop World 2011: Completing the Big Data Picture Understanding Why and Not ...
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
 
Google and big query
Google and big queryGoogle and big query
Google and big query
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
 
BigQuery for the Big Data win
BigQuery for the Big Data winBigQuery for the Big Data win
BigQuery for the Big Data win
 
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawakiGoogle Analytics and BigQuery, by Javier Ramirez, from datawaki
Google Analytics and BigQuery, by Javier Ramirez, from datawaki
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 

Similar to Google Cloud Dataflow

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
DSDT_MTL
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試
Simon Su
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
Lucas Jellema
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
Getting value from IoT, Integration and Data Analytics
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
Bogdan Kyryliuk
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
HanLing Shen
 
"Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou..."Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou...
Viktor Turskyi
 
Quick prototyping with VulcanJS
Quick prototyping with VulcanJSQuick prototyping with VulcanJS
Quick prototyping with VulcanJS
Melek Hakim
 
Data Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data EngineeringData Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data Engineering
Anant Corporation
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
Vanessa Lošić
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
Lightbend
 

Similar to Google Cloud Dataflow (20)

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
 
"Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou..."Offline mode for a mobile application, redux on server and a little bit abou...
"Offline mode for a mobile application, redux on server and a little bit abou...
 
Quick prototyping with VulcanJS
Quick prototyping with VulcanJSQuick prototyping with VulcanJS
Quick prototyping with VulcanJS
 
Data Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data EngineeringData Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data Engineering
 
Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Google Cloud Dataflow

  • 1. Cloud Dataflow A Google Service and SDK
  • 2. The Presenter Alex Van Boxel, Software Architect, software engineer, devops, tester,... at Vente-Exclusive.com Introduction Twitter @alexvb Plus +AlexVanBoxel E-mail alex@vanboxel.be Web http://alex.vanboxel.be
  • 4. Dataflow is... A set of SDK that define the programming model that you use to build your streaming and batch processing pipeline (*) Google Cloud Dataflow is a fully managed service that will run and optimize your pipeline
  • 5. Dataflow where... ETL ● Move ● Filter ● Enrich Analytics ● Streaming Compu ● Batch Compu ● Machine Learning Composition ● Connecting in Pub/Sub ○ Could be other dataflows ○ Could be something else
  • 6. NoOps Model ● Resources allocated on demand ● Lifecycle managed by the service ● Optimization ● Rebalancing Cloud Dataflow Benefits
  • 7. Unified Programming Model Unified: Streaming and Batch Open Sourced ● Java 8 implementation ● Python in the works Cloud Dataflow Benefits
  • 8. Unified Programming Model (streaming) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  • 9. Unified Programming Model (stream and backup) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  • 10. Unified Programming Model (batch) Cloud Dataflow Benefits DataFlow (streaming) DataFlow (batch) BigQuery
  • 12. DataFlow SDK becomes Apache Beam Cloud Dataflow Benefits
  • 13. DataFlow SDK becomes Apache Beam ● Pipeline first, runtime second ○ focus your data pipelines, not the characteristics runner ● Portability ○ portable across a number of runtime engines ○ runtime based on any number of considerations ■ performance ■ cost ■ scalability Cloud Dataflow Benefits
  • 14. DataFlow SDK becomes Apache Beam ● Unified model ○ Batch and streaming are integrated into a unified model ○ Powerful semantics, such as windowing, ordering and triggering ● Development tooling ○ tools you need to create portable data pipelines quickly and easily using open-source languages, libraries and tools. Cloud Dataflow Benefits
  • 15. DataFlow SDK becomes Apache Beam ● Scala Implementation ● Alternative Runners ○ Spark Runner from Cloudera Labs ○ Flink Runner from data Artisans Cloud Dataflow Benefits
  • 16. DataFlow SDK becomes Apache Beam ● Cloudera ● data Artisans ● Talend ● Cask ● PayPal Cloud Dataflow Benefits
  • 17. SDK Primitives Autopsy of a dataflow pipeline
  • 18. Pipeline ● Inputs: Pub/Sub, BigQuery, GCS, XML, JSON, … ● Transforms: Filter, Join, Aggregate, … ● Windows: Sliding, Sessions, Fixed, … ● Triggers: Correctness…. ● Outputs: Pub/Sub, BigQuery, GCS, XML, JSON, ... Logical Model
  • 19. PCollection<T> ● Immutable collection ● Could be bound or unbound ● Created by ○ a backing data store ○ a transformation from other PCollection ○ generated ● PCollection<KV<K,V>> Dataflow SDK Primitives
  • 20. ParDo<I,O> ● Parallel Processing of each element in the PCollection ● Developer provide DoFn<I,O> ● Shards processed independent of each other Dataflow SDK Primitives
  • 21. ParDo<I,O> Interesting concept of side outputs: allows for data sharing Dataflow SDK Primitives
  • 22. PTransform ● Combine small operations into bigger transforms ● Standard transform (COUNT, SUM,...) ● Reuse and Monitoring Dataflow SDK Primitives public class BlockTransform extends PTransform<PCollection<Legacy>, PCollection<Event>> { @Override public PCollection<Event> apply(PCollection<Legacy> input) { return input.apply(ParDo.of(new TimestampEvent1Fn())) .apply(ParDo.of(new FromLegacy2EventFn())).setCoder(AvroCoder.of(Event.class)) .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) .apply(ParDo.of(new ExtactChannelUserFn())) } }
  • 23. Merging ● GroupByKey ○ Takes a PCollection of KV<K,V> and groups them ○ Must be keyed ● Flatten ○ Join with same type Dataflow SDK Primitives
  • 24. Input/Output Dataflow SDK Primitives PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); // streamData is Unbounded; apply windowing afterward. PCollection<String> streamData = p.apply(PubsubIO.Read.named("ReadFromPubsub") .topic("/topics/my-topic")); PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); PCollection<TableRow> weatherData = p .apply(BigQueryIO.Read.named("ReadWeatherStations") .from("clouddataflow-readonly:samples.weather_stations"));
  • 25. ● Multiple Inputs + Outputs Input/Output Dataflow SDK Primitives
  • 26. Primitives on steroids Autopsy of a dataflow pipeline
  • 27. Windows ● Split unbounded collections (ex. Pub/Sub) ● Works on bounded collection as well Dataflow SDK Primitives .apply(Window.<Event>into(Sessions.withGapDuration(Duration.standardMinutes(20)))) Window.<CAL>into(SlidingWindows.of(Duration.standardMinutes(5)));
  • 28. Triggers ● Time-based ● Data-based ● Composite Dataflow SDK Primitives AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))));
  • 29. 14:22Event time 14:21 Input/Output Dataflow SDK Primitives 14:21 14:22 14:21:25Processing time Event time 14:2314:22 Watermark 14:21:20 14:21:40 Early trigger Late trigger Late trigger 14:25
  • 30. 14:22Event time 14:21 Input/Output Dataflow SDK Primitives 14:21 14:22 14:21:25Processing time Event time 14:2314:22 Watermark 14:21:20 14:21:40 Early trigger Late trigger Late trigger 14:25
  • 32. What you need ● Google Cloud SDK ○ utilities to authenticate and manage project ● Java IDE ○ Eclipse ○ IntelliJ ● Standard Build System ○ Maven ○ Gradle Dataflow SDK Primitives
  • 33. What you need Dataflow SDK Primitives subprojects { apply plugin: 'java' repositories { maven { url "http://jcenter.bintray.com" } mavenLocal() } dependencies { compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.3.0' testCompile 'org.hamcrest:hamcrest-all:1.3' testCompile 'org.assertj:assertj-core:3.0.0' testCompile 'junit:junit:4.12' } }
  • 34. How it work Google Cloud Dataflow Service Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via((String word) -> Arrays.asList(word.split("[^a-zA-Z']+"))) .withOutputType(new TypeDescriptor<String>() {})) .apply(Filter.byPredicate((String word) -> !word.isEmpty())) .apply(Count.<String>perElement()) .apply(MapElements .via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()) .withOutputType(new TypeDescriptor<String>() {})) .apply(TextIO.Write.to("gs://YOUR_OUTPUT_BUCKET/AND_OUTPUT_PREFIX"));
  • 35. Service ● Stage ● Optimize (Flume) ● Execute Google Cloud Dataflow Service
  • 40. Compute Worker Scaling Google Cloud Dataflow Service
  • 41. Compute Worker Rescaling Google Cloud Dataflow Service
  • 42. Compute Worker Rebalancing Google Cloud Dataflow Service
  • 43. Compute Worker Rescaling Google Cloud Dataflow Service
  • 47. Testability - Built in Dataflow Testability KV<String, List<CalEvent>> input = KV.of("", new ArrayList<>()); input.getValue().add(CalEvent.event("SESSION", "REFERRAL") .build()); input.getValue().add(CalEvent.event("SESSION", "START") .data("{"referral":{"Sourc...}").build()); DoFnTester<KV<?, List<CalEvent>>, CalEvent> fnTester = DoFnTester.of(new SessionRepairFn()); fnTester.processElement(input); List<CalEvent> calEvents = fnTester.peekOutputElements(); assertThat(calEvents).hasSize(2); CalEvent referral = CalUtil.findEvent(calEvents, "REFERRAL"); assertNotNull("REFERRAL should still exist", referral); CalEvent start = CalUtil.findEvent(calEvents, "START"); assertNotNull("START should still exist", start); assertNull("START should have referral removed.",
  • 49. Paper - Flume Java FlumeJava: Easy, Efficient Data-Parallel Pipelines http://pages.cs.wisc.edu/~akella/CS838/F12/838- CloudPapers/FlumeJava.pdf More Dataflow Resources
  • 50. DataFlow/Bean vs Spark Dataflow/Beam & Spark: A Programming Model Comparison https://cloud.google.com/dataflow/blog/dataflow-beam- and-spark-comparison More Dataflow Resources
  • 51. Github - Integrate Dataflow with Luigi Google Cloud integration for Luigi https://github.com/alexvanboxel/luigiext-gcloud More Dataflow Resources
  • 52. Questions and Answers The last slide Twitter @alexvb Plus +AlexVanBoxel E-mail alex@vanboxel.be Web http://alex.vanboxel.be