SlideShare a Scribd company logo
1 of 37
Download to read offline
How to build an
ETL pipeline
with Apache Beam on
Google Cloud Dataflow
Lucas Arruda
lucas@ciandt.com
ciandt.com
● Apache Beam
○ Concepts
○ Bounded data (Batch)
○ Unbounded data (Streaming)
○ Runners / SDKs
● Big Data Landscape
Table of Contents
ciandt.com
● Google Cloud Dataflow
○ Capabilities of a managed service
○ Integration with other GCP products
○ Architecture Diagram Samples
○ GCP Trial + Always Free
1 of 36
Lucas Arruda
Software Architect @ CI&T
ciandt.com
Introducing myself
Specialist on Google Cloud Platform
● Google Cloud Professional - Cloud Architect Certified
● Google Developer Expert (GDE)
● Google Authorized Trainer
○ trained 200+ people so far
○ Member of Google Cloud Insiders
● Google Qualified Developer in:
● Projects for Google and Fortune 200 companies
Compute
Engine
App
Engine
Cloud
Storage
Cloud SQL BigQuery
2 of 36
Big Data Isn't That Important:
It's The Small Insights That
Will Change Everything
- forbes.com
ciandt.com3 of 36
ciandt.com4 of 36
The real value of data and
analytics lie in their ability to
deliver rich insights.
- localytics.com
ciandt.com5 of 36
ciandt.com6 of 36
ciandt.com7 of 36
The Evolution of Apache Beam
ciandt.com
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
8 of 36
● Clearly separates data processing logic from
runtime requirements
● Expresses data-parallel Batch & Streaming
algorithms using one unified API
What is Apache Beam?
ciandt.com
Unified Programming Model for Batch & Streaming
● Supports execution on multiple distributed
processing runtime environments
9 of 36
Beam Model Concepts
Apache Beam
10 of 36
Beam Concepts
ciandt.com
PCollections
public static void main(String[] args) {
// Create the pipeline.
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
// Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(
"ReadMyFile", TextIO.read().from("protocol://path/to/some/inputData.txt"));
}
The PCollection abstraction represents a potentially distributed, multi-element data set.
It is immutable and can be either bounded (size is fixed/known) or unbounded (unlimited size).
11 of 36
Beam Concepts
ciandt.com
PTransforms
Transforms are the operations in your pipeline.
A transform takes a PCollection (or more than one PCollection) as input, performs an operation
that you specify on each element in that collection, and produces a new output PCollection.
[Output PCollection] = [Input PCollection].apply([Transform])
The Beam SDK provides the following core transforms:
● ParDo
● GroupByKey
● Combine
● Flatten and Partition
[Output PCollection] = [Input PCollection].apply([First Transform])
.apply([Second Transform])
.apply([Third Transform])
12 of 36
Beam Concepts - Core PTransforms
ParDo
// Apply a ParDo to the PCollection
PCollection<Integer> wordLengths = words.apply(
ParDo
.of(new ComputeWordLengthFn()));
// Implementation of a DoFn
static class ComputeWordLengthFn extends
DoFn<String, Integer> {
@ProcessElement
public void processElement(ProcessContext c)
{
// Get the input element
String word = c.element();
// Emit the output element.
c.output(word.length());
}
}
ciandt.com
Combine/Flatten
// Sum.SumIntegerFn() combines the elements in
the input PCollection.
PCollection<Integer> pc = ...;
PCollection<Integer> sum = pc.apply(
Combine.globally(new Sum.SumIntegerFn()));
// Flatten a PCollectionList of PCollection
objects of a given type.
PCollection<String> pc1 = ...;
PCollection<String> pc2 = ...;
PCollection<String> pc3 = ...;
PCollectionList<String> collections =
PCollectionList.of(pc1).and(pc2).and(pc3);
PCollection<String> merged =
collections.apply(Flatten.<String>pCollections(
));
GroupByKey
cat, [1,5,9]
dog, [5,2]
and, [1,2,6]
jump, [3]
tree, [2]
...
cat, 1
dog, 5
and, 1
jump, 3
tree, 2
cat, 5
dog, 2
and, 2
cat, 9
and, 6
...
ciandt.com13 of 36
Beam Concepts
ciandt.com
Pipeline Options
Use the pipeline options to configure different aspects of your pipeline, such as the pipeline
runner that will execute your pipeline, any runner-specific configuration or even provide input to
dynamically apply your data transformations.
--<option>=<value>
Custom Options
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
public interface MyOptions extends PipelineOptions {
@Description("My custom command line argument.")
@Default.String("DEFAULT")
String getMyCustomOption();
void setMyCustomOption(String myCustomOption);
}
14 of 36
Beam Concepts
ciandt.com
Pipeline I/O
Beam provides read and write transforms for a number of common data storage types,
including file-based, messaging and database types. Google Cloud Platform is also supported.
File-based reading
p.apply(“ReadFromText”,
TextIO.read().from("protocol://my_bucket/path/to/input-*.csv");
records.apply("WriteToText",
TextIO.write().to("protocol://my_bucket/path/to/numbers")
.withSuffix(".csv"));
File-based writing
BigQuery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Bigtable
Cloud
Datastore
Cloud
Spanner
15 of 36
Bounded Data
Batch
ciandt.com16 of 36
Batch Processing on Bounded Data
ciandt.com
Image: Tyler Akidau.
A finite pool of unstructured data on the left is run through a data processing
engine, resulting in corresponding structured data on the right
17 of 36
Batch Processing on Bounded Data
ciandt.com
Image: Tyler Akidau.
18 of 36
Unbounded Data
Streaming
ciandt.com19 of 36
Streaming Processing on Unbounded Data
ciandt.com
Windowing
into fixed
windows by
event time
Windowing
into fixed
windows by
processing
time.
20 of 36
If you care about correctness in a system that also cares about time, you need to use event-time windowing.
➔ What results are calculated?
ciandt.com
➔ Where in event time are results calculated?
➔ When in processing time are results materialized?
➔ How do refinements of results relate?
21 of 36
What results are calculated?
ciandt.com
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
22 of 36
Where in Event Time are results calculated?
ciandt.com23 of 36
Where in Event Time?
ciandt.com
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
24 of 36
When in processing time are results materialized?
ciandt.com
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.apply(Sum.integersPerKey());
25 of 36
How do refinements of results relate?
ciandt.com
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes()
.apply(Sum.integersPerKey());
26 of 36
➔ What results are calculated?
Transformations
➔ Where in event time are results calculated?
Windowing
➔ When in processing time are results materialized?
Watermark / Triggers
➔ How do refinements of results relate?
Accumulations
ciandt.com27 of 36
Runners & SDKs
ciandt.com28 of 36
ciandt.com
Beam Runners & SDKs
★ Beam Capability Matrix
★ Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
★ Future Proof: keep the code you
invested so much to write while
big data processing solutions
evolve
★ Choice of Runner: ability to
choose from a variety of runtimes
according to your current needs
29 of 36
<Insert here an image>
Logo
Dataflow fully automates management of
required processing resources. Auto-scaling
means optimum throughput and better
overall price-to-performance. Monitor
your jobs near real-time.
Fully managed, no-ops environment
ciandt.com
Integrates with Cloud Storage, Cloud
Pub/Sub, Cloud Datastore, Cloud Bigtable,
and BigQuery for seamless data processing.
And can be extended to interact with others
sources and sinks like Apache Kafka / HDFS.
Integrated with GCP products
[text] Developers wishing to extend the
Dataflow programming model can fork and
or submit pull requests on the Apache Beam
SDKs. Pipelines can also run on alternate
runtimes like Spark and Flink.
Based on Apache Beam’s SDK
<Insert here an image>
Cloud Dataflow provides built-in
support for fault-tolerant execution that is
consistent and correct regardless of data
size, cluster size, processing pattern or
pipeline complexity.
Reliable & Consistent Processing
30 of 36
Built on top of Compute Engine
ciandt.com
Google Compute Engine ranked #1 in price-performance by Cloud Spectator
31 of 36
Examples of Data Source/Sink in GCP
ciandt.com
Cloud Pub/Sub
Fully-managed real-time messaging service
that allows you to send and receive messages
between independent applications.
Google BigQuery
Fully-managed, serverless, petabyte scale,
low cost enterprise data warehouse for
analytics. BigQuery scans TB in seconds.
32 of 36
34The Products logos contained in this icon library may be used freely and without permission to accurately reference Google's technology and tools, for instance in books or architecture diagrams.
Streaming
Batch
Analytics Engine
BigQuery
Authentication
App Engine
Log Data
Cloud Storage
Data Processing
Cloud Dataflow
Data Exploration
Cloud Datalab
Async Messaging
Cloud Pub/Sub
Gaming Logs
Batch Load
Real-Time Events
Multiple Platforms
Report & Share
Business Analysis
Architecture: Gaming > Gaming Analytics
Architecture Diagram
ciandt.com33 of 36
Monitoring & Alerting
Architecture Diagram
ciandt.com
Data Analytics Platform Architecture
Source
Databases
Report & Share
Business Analysis
StackDriverMonitoringLogging
Error
Reporting
Scheduled Cron
App Engine
Data Processing
Cloud Dataflow
Analytics
BigQuery
Raw Files
Cloud Storage
Event-based
Cloud Functions
File Staging
Cloud Storage
1 Ingestion with gsutil
2 Streamed Processing of data
3 Scheduled processing of data
4 Data transformation process
5 Data prepared for analysis
6 Staging of processed data
1
2
3
4
5
6
On Premise? AWS?
7
7 Generation of reports
8 Monitoring of all resources
8
34 of 36
cloud.google.com/free ciandt.com35 of 36
Thank you!
- Q&A -
ciandt.com36 of 36

More Related Content

What's hot

Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentialsqureshihamid
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 

What's hot (20)

Data Sharing with Snowflake
Data Sharing with SnowflakeData Sharing with Snowflake
Data Sharing with Snowflake
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 

Similar to Build an ETL Pipeline with Apache Beam and Google Cloud Dataflow

IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQueryMárton Kodok
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsKevin Haag
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampBuilding Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampData Con LA
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBig Data Week
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsData Driven Innovation
 

Similar to Build an ETL Pipeline with Apache Beam and Google Cloud Dataflow (20)

IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
About The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe AnalyticsAbout The Event-Driven Data Layer & Adobe Analytics
About The Event-Driven Data Layer & Adobe Analytics
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampBuilding Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and Analytics
 

More from Lucas Arruda

Serverless no Google Cloud
Serverless no Google CloudServerless no Google Cloud
Serverless no Google CloudLucas Arruda
 
Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015
Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015
Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015Lucas Arruda
 
Selling the Open-Source Philosophy - DrupalCon Latin America
Selling the Open-Source Philosophy - DrupalCon Latin AmericaSelling the Open-Source Philosophy - DrupalCon Latin America
Selling the Open-Source Philosophy - DrupalCon Latin AmericaLucas Arruda
 
InterConPHP 2014 - Scaling PHP
InterConPHP 2014 - Scaling PHPInterConPHP 2014 - Scaling PHP
InterConPHP 2014 - Scaling PHPLucas Arruda
 
Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!
Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!
Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!Lucas Arruda
 
QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...
QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...
QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...Lucas Arruda
 
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object Calisthenics1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object CalisthenicsLucas Arruda
 
PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...
PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...
PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...Lucas Arruda
 
TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...
TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...
TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...Lucas Arruda
 

More from Lucas Arruda (9)

Serverless no Google Cloud
Serverless no Google CloudServerless no Google Cloud
Serverless no Google Cloud
 
Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015
Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015
Escalando PHP e Drupal: performance ao infinito e além! - DrupalCamp SP 2015
 
Selling the Open-Source Philosophy - DrupalCon Latin America
Selling the Open-Source Philosophy - DrupalCon Latin AmericaSelling the Open-Source Philosophy - DrupalCon Latin America
Selling the Open-Source Philosophy - DrupalCon Latin America
 
InterConPHP 2014 - Scaling PHP
InterConPHP 2014 - Scaling PHPInterConPHP 2014 - Scaling PHP
InterConPHP 2014 - Scaling PHP
 
Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!
Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!
Drupal Day SP 2014 - Virtualize seu Ambiente e Seja Produtivo!
 
QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...
QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...
QCon SP - ShortTalk - Virtualização e Provisionamento de Ambientes com Vagr...
 
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object Calisthenics1st CI&T Lightning Talks: Writing better code with Object Calisthenics
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
 
PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...
PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...
PHP Conference Brasil 2013 - Virtualização e Provisionamento de Ambientes c...
 
TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...
TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...
TDC2013 - PHP - Virtualização e Provisionamento de Ambientes com Vagrant e ...
 

Recently uploaded

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 

Recently uploaded (20)

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 

Build an ETL Pipeline with Apache Beam and Google Cloud Dataflow

  • 1. How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow Lucas Arruda lucas@ciandt.com ciandt.com
  • 2. ● Apache Beam ○ Concepts ○ Bounded data (Batch) ○ Unbounded data (Streaming) ○ Runners / SDKs ● Big Data Landscape Table of Contents ciandt.com ● Google Cloud Dataflow ○ Capabilities of a managed service ○ Integration with other GCP products ○ Architecture Diagram Samples ○ GCP Trial + Always Free 1 of 36
  • 3. Lucas Arruda Software Architect @ CI&T ciandt.com Introducing myself Specialist on Google Cloud Platform ● Google Cloud Professional - Cloud Architect Certified ● Google Developer Expert (GDE) ● Google Authorized Trainer ○ trained 200+ people so far ○ Member of Google Cloud Insiders ● Google Qualified Developer in: ● Projects for Google and Fortune 200 companies Compute Engine App Engine Cloud Storage Cloud SQL BigQuery 2 of 36
  • 4. Big Data Isn't That Important: It's The Small Insights That Will Change Everything - forbes.com ciandt.com3 of 36
  • 6. The real value of data and analytics lie in their ability to deliver rich insights. - localytics.com ciandt.com5 of 36
  • 9. The Evolution of Apache Beam ciandt.com MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow 8 of 36
  • 10. ● Clearly separates data processing logic from runtime requirements ● Expresses data-parallel Batch & Streaming algorithms using one unified API What is Apache Beam? ciandt.com Unified Programming Model for Batch & Streaming ● Supports execution on multiple distributed processing runtime environments 9 of 36
  • 11. Beam Model Concepts Apache Beam 10 of 36
  • 12. Beam Concepts ciandt.com PCollections public static void main(String[] args) { // Create the pipeline. PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create(); Pipeline p = Pipeline.create(options); // Create the PCollection 'lines' by applying a 'Read' transform. PCollection<String> lines = p.apply( "ReadMyFile", TextIO.read().from("protocol://path/to/some/inputData.txt")); } The PCollection abstraction represents a potentially distributed, multi-element data set. It is immutable and can be either bounded (size is fixed/known) or unbounded (unlimited size). 11 of 36
  • 13. Beam Concepts ciandt.com PTransforms Transforms are the operations in your pipeline. A transform takes a PCollection (or more than one PCollection) as input, performs an operation that you specify on each element in that collection, and produces a new output PCollection. [Output PCollection] = [Input PCollection].apply([Transform]) The Beam SDK provides the following core transforms: ● ParDo ● GroupByKey ● Combine ● Flatten and Partition [Output PCollection] = [Input PCollection].apply([First Transform]) .apply([Second Transform]) .apply([Third Transform]) 12 of 36
  • 14. Beam Concepts - Core PTransforms ParDo // Apply a ParDo to the PCollection PCollection<Integer> wordLengths = words.apply( ParDo .of(new ComputeWordLengthFn())); // Implementation of a DoFn static class ComputeWordLengthFn extends DoFn<String, Integer> { @ProcessElement public void processElement(ProcessContext c) { // Get the input element String word = c.element(); // Emit the output element. c.output(word.length()); } } ciandt.com Combine/Flatten // Sum.SumIntegerFn() combines the elements in the input PCollection. PCollection<Integer> pc = ...; PCollection<Integer> sum = pc.apply( Combine.globally(new Sum.SumIntegerFn())); // Flatten a PCollectionList of PCollection objects of a given type. PCollection<String> pc1 = ...; PCollection<String> pc2 = ...; PCollection<String> pc3 = ...; PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection<String> merged = collections.apply(Flatten.<String>pCollections( )); GroupByKey cat, [1,5,9] dog, [5,2] and, [1,2,6] jump, [3] tree, [2] ... cat, 1 dog, 5 and, 1 jump, 3 tree, 2 cat, 5 dog, 2 and, 2 cat, 9 and, 6 ... ciandt.com13 of 36
  • 15. Beam Concepts ciandt.com Pipeline Options Use the pipeline options to configure different aspects of your pipeline, such as the pipeline runner that will execute your pipeline, any runner-specific configuration or even provide input to dynamically apply your data transformations. --<option>=<value> Custom Options MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create(); public interface MyOptions extends PipelineOptions { @Description("My custom command line argument.") @Default.String("DEFAULT") String getMyCustomOption(); void setMyCustomOption(String myCustomOption); } 14 of 36
  • 16. Beam Concepts ciandt.com Pipeline I/O Beam provides read and write transforms for a number of common data storage types, including file-based, messaging and database types. Google Cloud Platform is also supported. File-based reading p.apply(“ReadFromText”, TextIO.read().from("protocol://my_bucket/path/to/input-*.csv"); records.apply("WriteToText", TextIO.write().to("protocol://my_bucket/path/to/numbers") .withSuffix(".csv")); File-based writing BigQuery Cloud Pub/Sub Cloud Storage Cloud Bigtable Cloud Datastore Cloud Spanner 15 of 36
  • 18. Batch Processing on Bounded Data ciandt.com Image: Tyler Akidau. A finite pool of unstructured data on the left is run through a data processing engine, resulting in corresponding structured data on the right 17 of 36
  • 19. Batch Processing on Bounded Data ciandt.com Image: Tyler Akidau. 18 of 36
  • 21. Streaming Processing on Unbounded Data ciandt.com Windowing into fixed windows by event time Windowing into fixed windows by processing time. 20 of 36 If you care about correctness in a system that also cares about time, you need to use event-time windowing.
  • 22. ➔ What results are calculated? ciandt.com ➔ Where in event time are results calculated? ➔ When in processing time are results materialized? ➔ How do refinements of results relate? 21 of 36
  • 23. What results are calculated? ciandt.com PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); 22 of 36
  • 24. Where in Event Time are results calculated? ciandt.com23 of 36
  • 25. Where in Event Time? ciandt.com PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); 24 of 36
  • 26. When in processing time are results materialized? ciandt.com PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .apply(Sum.integersPerKey()); 25 of 36
  • 27. How do refinements of results relate? ciandt.com PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes() .apply(Sum.integersPerKey()); 26 of 36
  • 28. ➔ What results are calculated? Transformations ➔ Where in event time are results calculated? Windowing ➔ When in processing time are results materialized? Watermark / Triggers ➔ How do refinements of results relate? Accumulations ciandt.com27 of 36
  • 30. ciandt.com Beam Runners & SDKs ★ Beam Capability Matrix ★ Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ★ Future Proof: keep the code you invested so much to write while big data processing solutions evolve ★ Choice of Runner: ability to choose from a variety of runtimes according to your current needs 29 of 36
  • 31. <Insert here an image> Logo Dataflow fully automates management of required processing resources. Auto-scaling means optimum throughput and better overall price-to-performance. Monitor your jobs near real-time. Fully managed, no-ops environment ciandt.com Integrates with Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud Bigtable, and BigQuery for seamless data processing. And can be extended to interact with others sources and sinks like Apache Kafka / HDFS. Integrated with GCP products [text] Developers wishing to extend the Dataflow programming model can fork and or submit pull requests on the Apache Beam SDKs. Pipelines can also run on alternate runtimes like Spark and Flink. Based on Apache Beam’s SDK <Insert here an image> Cloud Dataflow provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity. Reliable & Consistent Processing 30 of 36
  • 32. Built on top of Compute Engine ciandt.com Google Compute Engine ranked #1 in price-performance by Cloud Spectator 31 of 36
  • 33. Examples of Data Source/Sink in GCP ciandt.com Cloud Pub/Sub Fully-managed real-time messaging service that allows you to send and receive messages between independent applications. Google BigQuery Fully-managed, serverless, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery scans TB in seconds. 32 of 36
  • 34. 34The Products logos contained in this icon library may be used freely and without permission to accurately reference Google's technology and tools, for instance in books or architecture diagrams. Streaming Batch Analytics Engine BigQuery Authentication App Engine Log Data Cloud Storage Data Processing Cloud Dataflow Data Exploration Cloud Datalab Async Messaging Cloud Pub/Sub Gaming Logs Batch Load Real-Time Events Multiple Platforms Report & Share Business Analysis Architecture: Gaming > Gaming Analytics Architecture Diagram ciandt.com33 of 36
  • 35. Monitoring & Alerting Architecture Diagram ciandt.com Data Analytics Platform Architecture Source Databases Report & Share Business Analysis StackDriverMonitoringLogging Error Reporting Scheduled Cron App Engine Data Processing Cloud Dataflow Analytics BigQuery Raw Files Cloud Storage Event-based Cloud Functions File Staging Cloud Storage 1 Ingestion with gsutil 2 Streamed Processing of data 3 Scheduled processing of data 4 Data transformation process 5 Data prepared for analysis 6 Staging of processed data 1 2 3 4 5 6 On Premise? AWS? 7 7 Generation of reports 8 Monitoring of all resources 8 34 of 36
  • 37. Thank you! - Q&A - ciandt.com36 of 36