SlideShare a Scribd company logo
1 of 10
Download to read offline
Through the Looking Glass @ mabl
An intro to scalable, distributed counting using
google data flow
Geoff Cooney mabl engineer Boston GCP Meetup
Counting is hard
2
■ Apache Beam
▲ Common framework for batch and stream processing
▲ Abstracts the runner from the processing specification
● Plug and Play runners….if your features are supported
(https://beam.apache.org/documentation/runners/capability
-matrix/)
■ Google Data Flow (v2.0)
▲ Google implementation of an apache beam runner
▲ Manages scaling infrastructure up and down to meet needs
▲ Integrated with stackdriver logging
Introducing Apache Beam and Google Data Flow
3
■ Bounded vs. Unbounded Data
▲ Does the data end?
■ Pipelines
■ Event Time vs. Processing Time
▲ When did the event occur?
▲ When is dataflow processing it?
■ Watermark
▲ How far in event time have we gotten?
Key Concepts
4
PCollection<KV<String, Long>> counts = rows
.apply("extract ride status",
MapElements.into(TypeDescriptor.of(String.class))
.via(x -> x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
** rows is a PCollection<String> representing a collection of ride status strings
Building the Pipeline: What is being computed?
5
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: Where in Event Time?
6
Building the Pipeline: When in Processing Time?
7
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: How do refinements relate?
8
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Let’s see it in action...
9
Pipeline pipeline = Pipeline.create(pipelineOptions);
PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings()
.fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides"))
.withTimestampAttribute("ts"))
.apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class));
PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
PCollection<KV<String, Long>> counts = windowedRows
.apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x ->
x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity()))
.apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId));
pipeline.run();
Questions?

More Related Content

What's hot

Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppet
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試Simon Su
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
 
Serverless with Google Cloud Functions
Serverless with Google Cloud FunctionsServerless with Google Cloud Functions
Serverless with Google Cloud FunctionsJerry Jalava
 
Introduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsIntroduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsMalepati Bala Siva Sai Akhil
 
Kubeflow on google kubernetes engine
Kubeflow on google kubernetes engineKubeflow on google kubernetes engine
Kubeflow on google kubernetes engineBear Su
 
AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker kloia
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less opsJoseph Lust
 
GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享Simon Su
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppet
 
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Codemotion
 
Experiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlExperiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlOkis Chuang
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps TechnologiesKriangkrai Chaonithi
 

What's hot (19)

reBuy on Kubernetes
reBuy on KubernetesreBuy on Kubernetes
reBuy on Kubernetes
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet
 
JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試JCConf 2016 - Google Dataflow 小試
JCConf 2016 - Google Dataflow 小試
 
Kubeflow control plane
Kubeflow control planeKubeflow control plane
Kubeflow control plane
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OKServerless Big Data Architecture on Google Cloud Platform at Credit OK
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
 
Serverless with Google Cloud Functions
Serverless with Google Cloud FunctionsServerless with Google Cloud Functions
Serverless with Google Cloud Functions
 
Introduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud FunctionsIntroduction to Serverless and Google Cloud Functions
Introduction to Serverless and Google Cloud Functions
 
Kubeflow on google kubernetes engine
Kubeflow on google kubernetes engineKubeflow on google kubernetes engine
Kubeflow on google kubernetes engine
 
Go With The Flow
Go With The FlowGo With The Flow
Go With The Flow
 
AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker AWS ElasticBeanstalk and Docker
AWS ElasticBeanstalk and Docker
 
Serverless Apps on Google Cloud: more dev, less ops
Serverless Apps on Google Cloud:  more dev, less opsServerless Apps on Google Cloud:  more dev, less ops
Serverless Apps on Google Cloud: more dev, less ops
 
GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享GCPUG.TW - GCP學習資源分享
GCPUG.TW - GCP學習資源分享
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
 
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
 
Experiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and PostgresqlExperiences sharing about Lambda, Kinesis, and Postgresql
Experiences sharing about Lambda, Kinesis, and Postgresql
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Introduction to Modern DevOps Technologies
Introduction to  Modern DevOps TechnologiesIntroduction to  Modern DevOps Technologies
Introduction to Modern DevOps Technologies
 

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Dataplumbee
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
November 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorNovember 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorYahoo Developer Network
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Sergey Lukjanov
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...spinningmatt
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow (20)

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
November 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorNovember 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity Calculator
 
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 

Recently uploaded

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Recently uploaded (20)

Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

Through the looking glass an intro to scalable, distributed counting in dataflow

  • 1. Through the Looking Glass @ mabl An intro to scalable, distributed counting using google data flow Geoff Cooney mabl engineer Boston GCP Meetup
  • 3. ■ Apache Beam ▲ Common framework for batch and stream processing ▲ Abstracts the runner from the processing specification ● Plug and Play runners….if your features are supported (https://beam.apache.org/documentation/runners/capability -matrix/) ■ Google Data Flow (v2.0) ▲ Google implementation of an apache beam runner ▲ Manages scaling infrastructure up and down to meet needs ▲ Integrated with stackdriver logging Introducing Apache Beam and Google Data Flow 3
  • 4. ■ Bounded vs. Unbounded Data ▲ Does the data end? ■ Pipelines ■ Event Time vs. Processing Time ▲ When did the event occur? ▲ When is dataflow processing it? ■ Watermark ▲ How far in event time have we gotten? Key Concepts 4
  • 5. PCollection<KV<String, Long>> counts = rows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)) .via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); ** rows is a PCollection<String> representing a collection of ride status strings Building the Pipeline: What is being computed? 5
  • 6. PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); Building the Pipeline: Where in Event Time? 6
  • 7. Building the Pipeline: When in Processing Time? 7 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());
  • 8. Building the Pipeline: How do refinements relate? 8 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());
  • 9. Let’s see it in action... 9 Pipeline pipeline = Pipeline.create(pipelineOptions); PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings() .fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides")) .withTimestampAttribute("ts")) .apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class)); PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); PCollection<KV<String, Long>> counts = windowedRows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity())) .apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId)); pipeline.run();