SlideShare a Scribd company logo
1 of 46
Google Cloud Dataflow
Two Worlds Become a Much Better One
Eric Schmidt, Product Manager
cloude@google.com
You leave here understanding the fundamentals of Cloud Dataflow and
possibly have drawn some comparisons to existing data processing models.
We have some fun.
1
Goals
2
The Cloud Big Data
Promise of the Cloud and Big Data
Optimized
The Cloud For Big Data
Promise of the Cloud and Big Data
Batch Streaming
Data Processing
And
Batch Streaming
Data Processing
Time to answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the last year?
How many active viewers did I have in the last
minute?
How many sales were made in the last
hour due to advertising conversion?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
The tension and polarity of Big Data
AccuracySpeed
Cost control Complexity
Time to answer
❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
creates lag
The reality of Big Data elasticity & business
… that also provides accuracy control & intelligent resource elasticity
… to reduce operational complexity
… while optimizing resources to reduce cost
What if you just had simple knob for speed?
Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration
Benefits of Cloud Dataflow
❯ No Ops - truly elastic data processing for the cloud
• On demand resource allocation w/intelligent auto-scaling
• Automated worker lifetime management
• Automated work optimization
❯ Unified model - for batch & stream based processing
• Functional programming model
• Fine grained correctness primitives
❯ Open sourced SDK @ github
• Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK
• Python 2 in progress
• Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl
• Spark runner@ /cloudera/spark-dataflow
• Flink runner @ /dataArtisans/flink-dataflow
Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• Next milestone...
Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• April 16, 2015: Beta - now open to everyone
• Next milestone GA
cloud.google.com/dataflow
Management MobileDeveloper
Tools
Compute
Networking
Big Data
Storage
Big Data on Google Cloud
Capture
Pub/Sub
Process
Dataflow
Store
Storage
SQL
Datastore
Analyze
BigQuery
Dataflow
Open Source Tools
Big Data on Google Cloud
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available
messaging
Fully Managed, No-Ops Services
Time answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average viewing time over the past
7 days, compared to the year?
How many active viewers did I have in the last
minute?
How many sales were made in the last 30
minutes due to advertising conversion?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow BigQuery
How many active viewers did I have in the last
minute?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow
How many active viewers did I have in the last
minute?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Dataflow SDK
+
How many active viewers did I have in the last
minute?
Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/month
Cloud Pub/Sub Cloud Dataflow BigQuery
How many active viewers did I have in the last
minute?
Let’s build something - Demo!
Create a globally available queue
Create a dataset for massive scale ingest and query execution
Submit job
• Globally redundant
• Low latency (sub
sec.)
• Batched read/write
• Custom labels
• Push & Pull
• Auto expiration
Cloud Pub/Sub
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Big Query
Google Big Query
Fast ETL
Regex
JSON
Spreadsheets
BI Tools
Coworkers
Your Data
• Scales into Petabytes
• I/O of 1TB in 1 second
• 100,000 rows/sec Streaming API
• Simple data ingest from GCS or Hadoop
• Connect to R, Pandas, Hadoop, etc.
• Now available in our European data centers
• New row level security and data expiration
Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
weekly monthly
Cloud Dataflow SDK Release Process
<- At once guarantee (modulo completeness thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Correctness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
<- GCS, Pub/Sub, BigQuery, w/Avro, XML,
JSON,...
<- Time space Fixed, Sliding, Sessions, ...
<- GCS, Pub/Sub, BigQuery, ...
Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
Some Code
❯ A collection of data of type T in a
pipeline - a “hippie cousin” of an RDD
❯ Maybe be either bounded or
unbounded in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing
PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle, ...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
PCollections
Cloud Dataflow SDK
{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a
PCollection independently using a user-
provided DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() -
N times, finishBundle()
❯ Corresponds to both the Map and
Reduce phases in Hadoop i.e. ParDo-
>GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values
with the same key
• Corresponds to the shuffle phase
in Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on
an unbounded PCollection, but can also be used
for bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival/processing time or custom
event time
❯ Watermarks + Triggers enable robust
correctness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime
Cloud Dataflow Service
Managing with correctness
.apply(Window.<KV<String, PlaybackEvent>>into(
FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2)))))
.discardingFiredPanes());
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK
Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Service
Progress & Logs
● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● Reshard placement
...
Graph Optimization
Cloud Dataflow Service
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting
Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
Worker Scaling
Cloud Dataflow Service
Decreased Clock Time
100 mins. 65 mins.
vs.
Dynamic Work Rebalancing
Cloud Dataflow Service
Optimized
The Cloud For Big Data
Promise of the Cloud and Big Data
Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming
Thank You!
cloud.google.com/dataflow
cloude@google.com
StackOverflow @ google+cloud+dataflow

More Related Content

What's hot

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

What's hot (20)

Google Cloud Platform
Google Cloud PlatformGoogle Cloud Platform
Google Cloud Platform
 
Tom Grey - Google Cloud Platform
Tom Grey - Google Cloud PlatformTom Grey - Google Cloud Platform
Tom Grey - Google Cloud Platform
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Infographic: AWS vs Azure vs GCP: What's the best cloud platform for enterprise?
Infographic: AWS vs Azure vs GCP: What's the best cloud platform for enterprise?Infographic: AWS vs Azure vs GCP: What's the best cloud platform for enterprise?
Infographic: AWS vs Azure vs GCP: What's the best cloud platform for enterprise?
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming Applications
 
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
How Can I Build a Landing Zone & Extend my Operations into AWS to Support my ...
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Introduction to Google Cloud Services / Platforms
Introduction to Google Cloud Services / PlatformsIntroduction to Google Cloud Services / Platforms
Introduction to Google Cloud Services / Platforms
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)
 
AWS Route53
AWS Route53AWS Route53
AWS Route53
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCP
 
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 

Viewers also liked

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Trieu Nguyen
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 

Viewers also liked (14)

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
 
Stratio platform overview v4.1
Stratio platform overview v4.1Stratio platform overview v4.1
Stratio platform overview v4.1
 
[Strata] Sparkta
[Strata] Sparkta[Strata] Sparkta
[Strata] Sparkta
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushiGoogle Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushi
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Présentation Google Dataflow
Présentation Google DataflowPrésentation Google Dataflow
Présentation Google Dataflow
 

Similar to Google Cloud Dataflow Two Worlds Become a Much Better One

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
DataWorks Summit
 

Similar to Google Cloud Dataflow Two Worlds Become a Much Better One (20)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Google Cloud Dataflow Two Worlds Become a Much Better One

  • 1. Google Cloud Dataflow Two Worlds Become a Much Better One Eric Schmidt, Product Manager cloude@google.com
  • 2. You leave here understanding the fundamentals of Cloud Dataflow and possibly have drawn some comparisons to existing data processing models. We have some fun. 1 Goals 2
  • 3. The Cloud Big Data Promise of the Cloud and Big Data
  • 4. Optimized The Cloud For Big Data Promise of the Cloud and Big Data
  • 7. Time to answer some questions 1M Devices 16.6K Events/sec 43B Events/month 518B Events/year What was the average viewing time over the past 7 days, compared to the last year? How many active viewers did I have in the last minute? How many sales were made in the last hour due to advertising conversion?
  • 8. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month
  • 9. The tension and polarity of Big Data AccuracySpeed Cost control Complexity Time to answer
  • 10. ❯ Time & life never stop ❯ Data rates & schema are not static ❯ Scaling models are not static ❯ Non-elastic compute is wasteful and creates lag The reality of Big Data elasticity & business
  • 11. … that also provides accuracy control & intelligent resource elasticity … to reduce operational complexity … while optimizing resources to reduce cost What if you just had simple knob for speed?
  • 12. Cloud Dataflow Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines.
  • 13. • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Cloud Dataflow? AnalysisETL Orchestration
  • 14. Benefits of Cloud Dataflow ❯ No Ops - truly elastic data processing for the cloud • On demand resource allocation w/intelligent auto-scaling • Automated worker lifetime management • Automated work optimization ❯ Unified model - for batch & stream based processing • Functional programming model • Fine grained correctness primitives ❯ Open sourced SDK @ github • Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK • Python 2 in progress • Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl • Spark runner@ /cloudera/spark-dataflow • Flink runner @ /dataArtisans/flink-dataflow
  • 15. Release Timeline • June 24, 2014: Early Access Preview at Google I/O • Dec. 17, 2014: Alpha • Next milestone...
  • 16. Release Timeline • June 24, 2014: Early Access Preview at Google I/O • Dec. 17, 2014: Alpha • April 16, 2015: Beta - now open to everyone • Next milestone GA cloud.google.com/dataflow
  • 18. Big Data on Google Cloud Capture Pub/Sub Process Dataflow Store Storage SQL Datastore Analyze BigQuery Dataflow Open Source Tools
  • 19. Big Data on Google Cloud BigQuery Ingest data at 100,000 rows per second Dataflow Stream & batch processing, unified and simplified Pub/Sub Scalable, flexible, and globally available messaging Fully Managed, No-Ops Services
  • 20. Time answer some questions 1M Devices 16.6K Events/sec 43B Events/month 518B Events/year What was the average viewing time over the past 7 days, compared to the year? How many active viewers did I have in the last minute? How many sales were made in the last 30 minutes due to advertising conversion?
  • 21. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Dataflow BigQuery How many active viewers did I have in the last minute?
  • 22. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Dataflow How many active viewers did I have in the last minute?
  • 23. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Dataflow SDK + How many active viewers did I have in the last minute?
  • 24. Let’s build something 1M Devices 16.6K Events/sec 43B Events/month 518B Events/month Cloud Pub/Sub Cloud Dataflow BigQuery How many active viewers did I have in the last minute?
  • 25. Let’s build something - Demo! Create a globally available queue Create a dataset for massive scale ingest and query execution Submit job
  • 26. • Globally redundant • Low latency (sub sec.) • Batched read/write • Custom labels • Push & Pull • Auto expiration Cloud Pub/Sub Publisher A Publisher B Publisher C Message 1 Topic A Topic B Topic C Subscription XA Subscription XB Subscription YC Subscription ZC Cloud Pub/Sub Subscriber X Subscriber Y Message 2 Message 3 Subscriber Z Message 1 Message 2 Message 3 Message 3
  • 27. Big Query Google Big Query Fast ETL Regex JSON Spreadsheets BI Tools Coworkers Your Data • Scales into Petabytes • I/O of 1TB in 1 second • 100,000 rows/sec Streaming API • Simple data ingest from GCS or Hadoop • Connect to R, Pandas, Hadoop, etc. • Now available in our European data centers • New row level security and data expiration
  • 28. Cloud Dataflow Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines.
  • 29. weekly monthly Cloud Dataflow SDK Release Process
  • 30. <- At once guarantee (modulo completeness thresholds) Cloud Dataflow SDK <- Aggregations, Filters, Joins, ... <- Correctness Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Cloud Dataflow SDK - Logical Model <- GCS, Pub/Sub, BigQuery, w/Avro, XML, JSON,... <- Time space Fixed, Sliding, Sessions, ... <- GCS, Pub/Sub, BigQuery, ...
  • 31. Pipeline p = Pipeline.create( OptionsBuilder.RunOnService(true, false)); PCollection<String> rawData = p.begin().apply(TextIO.Read .from(OptionsBuilder.GCS_RAWDUMP_URI)); PCollection<PlaybackEvent> events = rawData.apply( new ParseTransform()); events.apply(new ArchiveTransform()); events.apply(new SessionAnalysisTransform()); events.apply(new AssetTransform()); p.run(); Java 7 Implementation Some Code
  • 32. ❯ A collection of data of type T in a pipeline - a “hippie cousin” of an RDD ❯ Maybe be either bounded or unbounded in size ❯ Created by using a PTransform to: • Build from a java.util.Collection • Read from a backing data store • Transform an existing PCollection ❯ Often contain the key-value pairs using KV {Seahawks, NFC, Champions, Seattle, ...} {..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...} PCollections Cloud Dataflow SDK
  • 33. {Seahawks, NFC, Champions, Seattle, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} ❯ Processes each element of a PCollection independently using a user- provided DoFn ❯ Elements are processed in arbitrary ‘bundles’ e.g. “shards” • startBundle(), processElement() - N times, finishBundle() ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo- >GBK->ParDo KeyBySessionId ParDo (“Parallel Do”) Cloud Dataflow SDK
  • 34. Wait a minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop Cloud Dataflow SDK GroupByKey {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
  • 35. ❯ Logically divide up or groups the elements of a PCollection into finite windows • Fixed Windows: hourly, daily, … • Sliding Windows • Sessions ❯ Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections ❯ Window.into() can be called at any point in the pipeline and will be applied when needed ❯ Can be tied to arrival/processing time or custom event time ❯ Watermarks + Triggers enable robust correctness Windows Cloud Dataflow SDK Nighttime Mid-Day Nighttime
  • 36. Cloud Dataflow Service Managing with correctness .apply(Window.<KV<String, PlaybackEvent>>into( FixedWindows.of(Duration.standardMinutes(1))) .triggering( AfterEach.inOrder( AfterWatermark.pastEndOfWindow(), Repeatedly.forever(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10))) .orFinally(AfterWatermark .pastEndOfWindow() .plusDelayOf(Duration.standardDays(2))))) .discardingFiredPanes());
  • 37. GroupByKey Pair With Ones Sum Values Count ❯ Define new PTransforms by building up subgraphs of existing transforms ❯ Some utilities are included in the SDK • Count, RemoveDuplicates, Join, Min, Max, Sum, ... ❯ You can define your own: • DoSomething, DoSomethingElse, etc. ❯ Why bother? • Code reuse • Better monitoring experience Composite PTransforms Cloud Dataflow SDK
  • 38. Cloud Dataflow Cloud Dataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines.
  • 39. GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Cloud Dataflow Service Progress & Logs
  • 40. ● ParDo fusion ○ Producer Consumer ○ Sibling ○ Intelligent fusion boundaries ● Combiner lifting e.g. partial aggregations before reduction ● Reshard placement ... Graph Optimization Cloud Dataflow Service C D C+D consumer-producer = ParallelDo GBK = GroupByKey + = CombineValues sibling C D C+D A GBK + B A+ GBK + B combiner lifting
  • 41. Deploy Schedule & Monitor Tear Down Worker Lifecycle Management Cloud Dataflow Service
  • 42. Worker Scaling Cloud Dataflow Service Decreased Clock Time
  • 43. 100 mins. 65 mins. vs. Dynamic Work Rebalancing Cloud Dataflow Service
  • 44. Optimized The Cloud For Big Data Promise of the Cloud and Big Data
  • 45. Optimizing Your Time To Answer More time to dig into your data Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling Growing Scale Utilization improvements Data Processing with Cloud DataflowTypical Data Processing Programming