SlideShare a Scribd company logo
Through the Looking Glass @ mabl
An intro to scalable, distributed counting using
google data flow
Geoff Cooney mabl engineer Boston GCP Meetup
Counting is hard
2
■ Apache Beam
▲ Common framework for batch and stream processing
▲ Abstracts the runner from the processing specification
● Plug and Play runners….if your features are supported
(https://beam.apache.org/documentation/runners/capability
-matrix/)
■ Google Data Flow (v2.0)
▲ Google implementation of an apache beam runner
▲ Manages scaling infrastructure up and down to meet needs
▲ Integrated with stackdriver logging
Introducing Apache Beam and Google Data Flow
3
■ Bounded vs. Unbounded Data
▲ Does the data end?
■ Pipelines
■ Event Time vs. Processing Time
▲ When did the event occur?
▲ When is dataflow processing it?
■ Watermark
▲ How far in event time have we gotten?
Key Concepts
4
PCollection<KV<String, Long>> counts = rows
.apply("extract ride status",
MapElements.into(TypeDescriptor.of(String.class))
.via(x -> x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
** rows is a PCollection<String> representing a collection of ride status strings
Building the Pipeline: What is being computed?
5
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: Where in Event Time?
6
Building the Pipeline: When in Processing Time?
7
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: How do refinements relate?
8
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Let’s see it in action...
9
Pipeline pipeline = Pipeline.create(pipelineOptions);
PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings()
.fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides"))
.withTimestampAttribute("ts"))
.apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class));
PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
PCollection<KV<String, Long>> counts = windowedRows
.apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x ->
x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity()))
.apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId));
pipeline.run();
Questions?

More Related Content

What's hot

Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppet
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試Simon Su
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
 
Serverless with Google Cloud Functions
Serverless with Google Cloud FunctionsServerless with Google Cloud Functions
Serverless with Google Cloud FunctionsJerry Jalava
 
Introduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsIntroduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsMalepati Bala Siva Sai Akhil
 
Kubeflow on google kubernetes engine
Kubeflow on google kubernetes engineKubeflow on google kubernetes engine
Kubeflow on google kubernetes engineBear Su
 
AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker kloia
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less opsJoseph Lust
 
GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享Simon Su
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppet
 
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Codemotion
 
Experiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlExperiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlOkis Chuang
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps TechnologiesKriangkrai Chaonithi
 

What's hot (19)

reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試
 
Kubeflow control plane
Kubeflow control planeKubeflow control plane
Kubeflow control plane
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
 
Serverless with Google Cloud Functions
Serverless with Google Cloud FunctionsServerless with Google Cloud Functions
Serverless with Google Cloud Functions
 
Introduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsIntroduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud Functions
 
Kubeflow on google kubernetes engine
Kubeflow on google kubernetes engineKubeflow on google kubernetes engine
Kubeflow on google kubernetes engine
 
Go With The Flow
Go With The FlowGo With The Flow
Go With The Flow
 
AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
 
GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
 
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
 
Experiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlExperiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and Postgresql
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps Technologies
 

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Dataplumbee
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
November 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorNovember 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorYahoo Developer Network
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Sergey Lukjanov
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...spinningmatt
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow (20)

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
November 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorNovember 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity Calculator
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 

Recently uploaded

iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockSkilrock Technologies
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring SoftwareMera Monitor
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationHelp Desk Migration
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfMeon Technology
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandIES VE
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 

Recently uploaded (20)

iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Benefits of Employee Monitoring Software
Benefits of  Employee Monitoring SoftwareBenefits of  Employee Monitoring Software
Benefits of Employee Monitoring Software
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 

Through the looking glass an intro to scalable, distributed counting in dataflow

  • 1. Through the Looking Glass @ mabl An intro to scalable, distributed counting using google data flow Geoff Cooney mabl engineer Boston GCP Meetup
  • 3. ■ Apache Beam ▲ Common framework for batch and stream processing ▲ Abstracts the runner from the processing specification ● Plug and Play runners….if your features are supported (https://beam.apache.org/documentation/runners/capability -matrix/) ■ Google Data Flow (v2.0) ▲ Google implementation of an apache beam runner ▲ Manages scaling infrastructure up and down to meet needs ▲ Integrated with stackdriver logging Introducing Apache Beam and Google Data Flow 3
  • 4. ■ Bounded vs. Unbounded Data ▲ Does the data end? ■ Pipelines ■ Event Time vs. Processing Time ▲ When did the event occur? ▲ When is dataflow processing it? ■ Watermark ▲ How far in event time have we gotten? Key Concepts 4
  • 5. PCollection<KV<String, Long>> counts = rows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)) .via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); ** rows is a PCollection<String> representing a collection of ride status strings Building the Pipeline: What is being computed? 5
  • 6. PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); Building the Pipeline: Where in Event Time? 6
  • 7. Building the Pipeline: When in Processing Time? 7 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());
  • 8. Building the Pipeline: How do refinements relate? 8 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());
  • 9. Let’s see it in action... 9 Pipeline pipeline = Pipeline.create(pipelineOptions); PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings() .fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides")) .withTimestampAttribute("ts")) .apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class)); PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); PCollection<KV<String, Long>> counts = windowedRows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity())) .apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId)); pipeline.run();