Through the looking glass an intro to scalable, distributed counting in dataflow

•

0 likes•16,207 views

Lightning talk I gave at GCP Boston meetup for a quick hands on intro to google dataflow. Example based on the public pubsub topic described here: https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon

Software

Through the Looking Glass @ mabl
An intro to scalable, distributed counting using
google data flow
Geoff Cooney mabl engineer Boston GCP Meetup

■ Apache Beam
▲ Common framework for batch and stream processing
▲ Abstracts the runner from the processing specification
● Plug and Play runners….if your features are supported
(https://beam.apache.org/documentation/runners/capability
-matrix/)
■ Google Data Flow (v2.0)
▲ Google implementation of an apache beam runner
▲ Manages scaling infrastructure up and down to meet needs
▲ Integrated with stackdriver logging
Introducing Apache Beam and Google Data Flow
3

■ Bounded vs. Unbounded Data
▲ Does the data end?
■ Pipelines
■ Event Time vs. Processing Time
▲ When did the event occur?
▲ When is dataflow processing it?
■ Watermark
▲ How far in event time have we gotten?
Key Concepts
4

PCollection<KV<String, Long>> counts = rows
.apply("extract ride status",
MapElements.into(TypeDescriptor.of(String.class))
.via(x -> x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
** rows is a PCollection<String> representing a collection of ride status strings
Building the Pipeline: What is being computed?
5

PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: Where in Event Time?
6

Building the Pipeline: When in Processing Time?
7
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());

Building the Pipeline: How do refinements relate?
8
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());

Let’s see it in action...
9
Pipeline pipeline = Pipeline.create(pipelineOptions);
PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings()
.fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides"))
.withTimestampAttribute("ts"))
.apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class));
PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
PCollection<KV<String, Long>> counts = windowedRows
.apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x ->
x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity()))
.apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId));
pipeline.run();

What's hot

reBuy on KubernetesStephan Lindauer

Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi

PuppetConf 2017: Zero to Kubernetes -Scott Coulton, PuppetPuppet

JCConf 2016 - Google Dataflow 小試Simon Su

Kubeflow control planeWeiqiang Zhuang

Airflow at WePayChris Riccomini

Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi

Serverless with Google Cloud FunctionsJerry Jalava

Introduction to Serverless and Google Cloud FunctionsMalepati Bala Siva Sai Akhil

Kubeflow on google kubernetes engineBear Su

Go With The FlowPhilWinstanley

AWS ElasticBeanstalk and Docker kloia

Serverless Apps on Google Cloud: more dev, less opsJoseph Lust

GCPUG.TW - GCP學習資源分享Simon Su

PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppet

Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Codemotion

Experiences sharing about Lambda, Kinesis, and PostgresqlOkis Chuang

Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc

Introduction to Modern DevOps TechnologiesKriangkrai Chaonithi

What's hot (19)

reBuy on Kubernetes

Building a Data Pipeline using Apache Airflow (on AWS / GCP)

PuppetConf 2017: Zero to Kubernetes -Scott Coulton, Puppet

JCConf 2016 - Google Dataflow 小試

Kubeflow control plane

Airflow at WePay

Serverless Big Data Architecture on Google Cloud Platform at Credit OK

Serverless with Google Cloud Functions

Introduction to Serverless and Google Cloud Functions

Kubeflow on google kubernetes engine

Go With The Flow

AWS ElasticBeanstalk and Docker

Serverless Apps on Google Cloud: more dev, less ops

GCPUG.TW - GCP學習資源分享

PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet

Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017

Experiences sharing about Lambda, Kinesis, and Postgresql

Running Airflow Workflows as ETL Processes on Hadoop

Introduction to Modern DevOps Technologies

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward

OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd

Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward

Transforming Mobile Push Notifications with Big Dataplumbee

Improving Apache Spark DownscalingDatabricks

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen

Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit

Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr

November 2013 HUG: Compute Capacity CalculatorYahoo Developer Network

Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...Sergey Lukjanov

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA

The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...spinningmatt

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud

Big data should be simpleDori Waldman

SamzaSQL QCon'16 presentationYi Pan

Giga Spaces Data Grid / Data Caching Overviewjimliddle

Creating PostgreSQL-as-a-Service at ScaleSean Chittenden

Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak

Scio - Moving to Google Cloud, A Spotify StoryNeville Li

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow (20)

Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam

OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...

Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...

Transforming Mobile Push Notifications with Big Data

Improving Apache Spark Downscaling

SF Big Analytics 20191112: How to performance-tune Spark applications in larg...

Unified, Efficient, and Portable Data Processing with Apache Beam

Integrating ChatGPT with Apache Airflow

November 2013 HUG: Compute Capacity Calculator

Atlanta OpenStack Summit: The State of OpenStack Data Processing: Sahara, Now...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

The state of the art for OpenStack Data Processing (Hadoop on OpenStack) - At...

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices

Big data should be simple

SamzaSQL QCon'16 presentation

Giga Spaces Data Grid / Data Caching Overview

Creating PostgreSQL-as-a-Service at Scale

Apache Beam and Google Cloud Dataflow - IDG - final

Scio - Moving to Google Cloud, A Spotify Story

Recently uploaded

Right Money Management App For Your Financial GoalsJhone kinadey

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

TECUNIQUE: Success Stories: IT Service providermohitmore19

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Recently uploaded (20)

Right Money Management App For Your Financial Goals

A Secure and Reliable Document Management System is Essential.docx

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

Microsoft AI Transformation Partner Playbook.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

HR Software Buyers Guide in 2024 - HRSoftware.com

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Unlocking the Future of AI Agents with Large Language Models

Optimizing AI for immediate response in Smart CCTV

TECUNIQUE: Success Stories: IT Service provider

How To Use Server-Side Rendering with Nuxt.js

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Hand gesture recognition PROJECT PPT.pptx

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Through the looking glass an intro to scalable, distributed counting in dataflow

1. Through the Looking Glass @ mabl An intro to scalable, distributed counting using google data flow Geoff Cooney mabl engineer Boston GCP Meetup

2. Counting is hard 2

3. ■ Apache Beam ▲ Common framework for batch and stream processing ▲ Abstracts the runner from the processing specification ● Plug and Play runners….if your features are supported (https://beam.apache.org/documentation/runners/capability -matrix/) ■ Google Data Flow (v2.0) ▲ Google implementation of an apache beam runner ▲ Manages scaling infrastructure up and down to meet needs ▲ Integrated with stackdriver logging Introducing Apache Beam and Google Data Flow 3

4. ■ Bounded vs. Unbounded Data ▲ Does the data end? ■ Pipelines ■ Event Time vs. Processing Time ▲ When did the event occur? ▲ When is dataflow processing it? ■ Watermark ▲ How far in event time have we gotten? Key Concepts 4

5. PCollection<KV<String, Long>> counts = rows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)) .via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); ** rows is a PCollection<String> representing a collection of ride status strings Building the Pipeline: What is being computed? 5

6. PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); Building the Pipeline: Where in Event Time? 6

7. Building the Pipeline: When in Processing Time? 7 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());

8. Building the Pipeline: How do refinements relate? 8 PCollection<TableRow> windowedRows = tableRows .apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes());

9. Let’s see it in action... 9 Pipeline pipeline = Pipeline.create(pipelineOptions); PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings() .fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides")) .withTimestampAttribute("ts")) .apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class)); PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals", Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))) .withLateFirings(AfterPane.elementCountAtLeast(1))) .withAllowedLateness(Duration.standardMinutes(30)) .accumulatingFiredPanes()); PCollection<KV<String, Long>> counts = windowedRows .apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x -> x.get("ride_status").toString())) .apply("count total rides", Count.perElement()); counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity())) .apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId)); pipeline.run();

10. Questions?

Through the looking glass an intro to scalable, distributed counting in dataflow

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow

Similar to Through the looking glass an intro to scalable, distributed counting in dataflow (20)

Recently uploaded

Recently uploaded (20)

Through the looking glass an intro to scalable, distributed counting in dataflow