Through the looking glass an intro to scalable, distributed counting in dataflow

•

0 likes•16,208 views

Lightning talk I gave at GCP Boston meetup for a quick hands on intro to google dataflow. Example based on the public pubsub topic described here: https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon

Software

Through the Looking Glass @ mabl
An intro to scalable, distributed counting using
google data flow
Geoff Cooney mabl engineer Boston GCP Meetup

■ Apache Beam
▲ Common framework for batch and stream processing
▲ Abstracts the runner from the processing specification
● Plug and Play runners….if your features are supported
(https://beam.apache.org/documentation/runners/capability
-matrix/)
■ Google Data Flow (v2.0)
▲ Google implementation of an apache beam runner
▲ Manages scaling infrastructure up and down to meet needs
▲ Integrated with stackdriver logging
Introducing Apache Beam and Google Data Flow
3

■ Bounded vs. Unbounded Data
▲ Does the data end?
■ Pipelines
■ Event Time vs. Processing Time
▲ When did the event occur?
▲ When is dataflow processing it?
■ Watermark
▲ How far in event time have we gotten?
Key Concepts
4

PCollection<KV<String, Long>> counts = rows
.apply("extract ride status",
MapElements.into(TypeDescriptor.of(String.class))
.via(x -> x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
** rows is a PCollection<String> representing a collection of ride status strings
Building the Pipeline: What is being computed?
5

PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: Where in Event Time?
6

Building the Pipeline: When in Processing Time?
7
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());

Building the Pipeline: How do refinements relate?
8
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());

Let’s see it in action...
9
Pipeline pipeline = Pipeline.create(pipelineOptions);
PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings()
.fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides"))
.withTimestampAttribute("ts"))
.apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class));
PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
PCollection<KV<String, Long>> counts = windowedRows
.apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x ->
x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity()))
.apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId));
pipeline.run();

What's hot

reBuy on KubernetesStephan Lindauer

Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi

PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppet

JCConf 2016 - Google Dataflow 小試Simon Su

Kubeflow control planeWeiqiang Zhuang

Airflow at WePayChris Riccomini

Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi

Serverless with Google Cloud FunctionsJerry Jalava

Introduction to Serverless and Google Cloud FunctionsMalepati Bala Siva Sai Akhil

Kubeflow on google kubernetes engineBear Su

Go With The FlowPhilWinstanley

AWS ElasticBeanstalk and Docker kloia

Serverless Apps on Google Cloud: more dev, less opsJoseph Lust

GCPUG.TW - GCP學習資源分享Simon Su

PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppet

Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Codemotion

Experiences sharing about Lambda, Kinesis, and PostgresqlOkis Chuang

Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc

Introduction to Modern DevOps TechnologiesKriangkrai Chaonithi

What's hot (19)

reBuy on Kubernetes

Building a Data Pipeline using Apache Airflow (on AWS / GCP)

PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet

JCConf 2016 - Google Dataflow 小試

Kubeflow control plane

Airflow at WePay

Serverless Big Data Architecture on Google Cloud Platform at Credit OK

Serverless with Google Cloud Functions

Introduction to Serverless and Google Cloud Functions

Kubeflow on google kubernetes engine

Go With The Flow

AWS ElasticBeanstalk and Docker

Serverless Apps on Google Cloud: more dev, less ops

GCPUG.TW - GCP學習資源分享

PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet

Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017

Experiences sharing about Lambda, Kinesis, and Postgresql

Running Airflow Workflows as ETL Processes on Hadoop

Introduction to Modern DevOps Technologies

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward

OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd

Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward

Transforming Mobile Push Notifications with Big Dataplumbee

Improving Apache Spark DownscalingDatabricks

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen

Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit

Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr

November 2013 HUG: Compute Capacity CalculatorYahoo Developer Network

Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Sergey Lukjanov

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA

The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...spinningmatt

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud

Big data should be simpleDori Waldman

SamzaSQL QCon'16 presentationYi Pan

Giga Spaces Data Grid / Data Caching Overviewjimliddle

Creating PostgreSQL-as-a-Service at ScaleSean Chittenden

Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak

Scio - Moving to Google Cloud, A Spotify StoryNeville Li

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow (20)

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam

OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...

Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...

Transforming Mobile Push Notifications with Big Data

Improving Apache Spark Downscaling

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...

Unified, Efficient, and Portable Data Processing with Apache Beam

Integrating ChatGPT with Apache Airflow

November 2013 HUG: Compute Capacity Calculator

Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices

Big data should be simple

SamzaSQL QCon'16 presentation

Giga Spaces Data Grid / Data Caching Overview

Creating PostgreSQL-as-a-Service at Scale

Apache Beam and Google Cloud Dataflow - IDG - final

Scio - Moving to Google Cloud, A Spotify Story

Recently uploaded

iGaming Platform & Lottery Solutions by SkilrockSkilrock Technologies

Designing for Privacy in Amazon Web ServicesKrzysztofKkol1

Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl

top nidhi software solution freedownloadvrstrong314

Benefits of Employee Monitoring SoftwareMera Monitor

A Guideline to Gorgias to to Re:amaze Data MigrationHelp Desk Migration

Breaking the Code : A Guide to WhatsApp Business API.pdfMeon Technology

A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration

5 Reasons Driving Warehouse Management Systems DemandCanary7-Warehouse Management System

Secure Software Ecosystem Teqnation 2024Soroosh Khodami

KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j

GraphAware - Transforming policing with graph-based intelligence analysisNeo4j

Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy

AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.

How to install and activate eGrabber JobGrabbereGrabber

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.

Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez

Top Mobile App Development Companies 2024XongoLab Technologies LLP

Using IESVE for Room Loads Analysis - Australia & New ZealandIES VE

A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171

Recently uploaded (20)

iGaming Platform & Lottery Solutions by Skilrock

Designing for Privacy in Amazon Web Services

Agnieszka Andrzejewska - BIM School Course in Kraków

top nidhi software solution freedownload

Benefits of Employee Monitoring Software

A Guideline to Gorgias to to Re:amaze Data Migration

Breaking the Code : A Guide to WhatsApp Business API.pdf

A Guideline to Zendesk to Re:amaze Data Migration

5 Reasons Driving Warehouse Management Systems Demand

Secure Software Ecosystem Teqnation 2024

KLARNA - Language Models and Knowledge Graphs: A Systems Approach

GraphAware - Transforming policing with graph-based intelligence analysis

Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)

AI/ML Infra Meetup | Perspective on Deep Learning Framework

How to install and activate eGrabber JobGrabber

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf

Top Mobile App Development Companies 2024

Using IESVE for Room Loads Analysis - Australia & New Zealand

A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf

Through the looking glass an intro to scalable, distributed counting in dataflow

1. Through the Looking Glass @ mabl An intro to scalable, distributed counting using google data flow Geoff Cooney mabl engineer Boston GCP Meetup

2. Counting is hard 2

3. ■ Apache Beam ▲ Common framework for batch and stream processing ▲ Abstracts the runner from the processing specification ● Plug and Play runners….if your features are supported (https://beam.apache.org/documentation/runners/capability -matrix/) ■ Google Data Flow (v2.0) ▲ Google implementation of an apache beam runner ▲ Manages scaling infrastructure up and down to meet needs ▲ Integrated with stackdriver logging Introducing Apache Beam and Google Data Flow 3

4. ■ Bounded vs. Unbounded Data ▲ Does the data end? ■ Pipelines ■ Event Time vs. Processing Time ▲ When did the event occur? ▲ When is dataflow processing it? ■ Watermark ▲ How far in event time have we gotten? Key Concepts 4

5. PCollection<KV<String, Long>> counts = rows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)) .via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); ** rows is a PCollection<String> representing a collection of ride status strings Building the Pipeline: What is being computed? 5

6. PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); Building the Pipeline: Where in Event Time? 6

7. Building the Pipeline: When in Processing Time? 7 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());

8. Building the Pipeline: How do refinements relate? 8 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());

9. Let’s see it in action... 9 Pipeline pipeline = Pipeline.create(pipelineOptions); PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings() .fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides")) .withTimestampAttribute("ts")) .apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class)); PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); PCollection<KV<String, Long>> counts = windowedRows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity())) .apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId)); pipeline.run();

10. Questions?

Through the looking glass an intro to scalable, distributed counting in dataflow

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow (20)

Recently uploaded

Recently uploaded (20)

Through the looking glass an intro to scalable, distributed counting in dataflow