Lightning talk I gave at GCP Boston meetup for a quick hands on intro to google dataflow. Example based on the public pubsub topic described here: https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon
3. ■ Apache Beam
▲ Common framework for batch and stream processing
▲ Abstracts the runner from the processing specification
● Plug and Play runners….if your features are supported
(https://beam.apache.org/documentation/runners/capability
-matrix/)
■ Google Data Flow (v2.0)
▲ Google implementation of an apache beam runner
▲ Manages scaling infrastructure up and down to meet needs
▲ Integrated with stackdriver logging
Introducing Apache Beam and Google Data Flow
3
4. ■ Bounded vs. Unbounded Data
▲ Does the data end?
■ Pipelines
■ Event Time vs. Processing Time
▲ When did the event occur?
▲ When is dataflow processing it?
■ Watermark
▲ How far in event time have we gotten?
Key Concepts
4
5. PCollection<KV<String, Long>> counts = rows
.apply("extract ride status",
MapElements.into(TypeDescriptor.of(String.class))
.via(x -> x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
** rows is a PCollection<String> representing a collection of ride status strings
Building the Pipeline: What is being computed?
5
6. PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
Building the Pipeline: Where in Event Time?
6
7. Building the Pipeline: When in Processing Time?
7
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
8. Building the Pipeline: How do refinements relate?
8
PCollection<TableRow> windowedRows = tableRows
.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
9. Let’s see it in action...
9
Pipeline pipeline = Pipeline.create(pipelineOptions);
PCollection<Map> tableRows = pipeline.apply(PubsubIO.readStrings()
.fromSubscription(String.format("projects/%s/subscriptions/%s", projectId, "geoff-taxirides"))
.withTimestampAttribute("ts"))
.apply("Parse input", ParseJsons.of(Map.class)).setCoder(AvroCoder.of(Map.class));
PCollection<TableRow> windowedRows = tableRows.apply("Window into one hour intervals",
Window.<TableRow>into(FixedWindows.of(Duration.standardHours(1)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5)))
.withLateFirings(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.standardMinutes(30))
.accumulatingFiredPanes());
PCollection<KV<String, Long>> counts = windowedRows
.apply("extract ride status", MapElements.into(TypeDescriptor.of(String.class)).via(x ->
x.get("ride_status").toString()))
.apply("count total rides", Count.perElement());
counts.apply("Convert to Datastore Entities", ParDo.of(new CountsToEntity()))
.apply("Write to Data Store", DatastoreIO.v1().write().withProjectId(projectId));
pipeline.run();