Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Sep 2015
Google Dataflow
introduction
iglushkov@machinezone.com
What is Google Dataflow
❖ Data processing system: batch and streaming
❖ Set of SDKs
❖ Google Cloud Platform managed servic...
Programming Model
❖ Pipeline - entire series of computations
❖ PCollection - set of data in a pipeline
❖ Transform - any d...
Pipeline
❖ Data + Transforms
❖ Branching + merging
❖ Multiple sources
❖ Unit testing + Integration testing
❖ Pipeline Exec...
PCollection
❖ Represent data in a pipeline from any source
❖ Potentially unlimited (stream)
❖ Serializable, immutable, no ...
Windowing
❖ Window - subdivided logical parts of a PCollection
❖ Each element is assigned to one or more windows
❖ Fixed t...
Late Data
❖ Event time / Processing time
❖ No order guarantee
❖ No consistent delta b/w Event and Processing time
❖ Waterm...
Triggers
❖ Enough data for the window -> aggregate result: “pane”
❖ Help handle late data
❖ Time-based triggers
❖ Data-dri...
Transforms
❖ Math, convert format, grouping, filtering, combining
❖ [PCollection] -> [PCollection]
❖ Core Transforms: ParDo...
Pipeline I/O
❖ Read/Write from/to external sources
❖ Text Files in Google Cloud Storage or local FS
❖ BigQuery tables
❖ Go...
Extra
❖ Parallelization, distribution, optimization, scaling
❖ Dataflow monitoring UI and CLI
❖ Logging
❖ Unit testing (loc...
Questions?
Upcoming SlideShare
Loading in …5
×

Google Dataflow Intro

903 views

Published on

Main concepts of Google Dataflow. Pipelines, Windowing, Triggers, Late Data, etc.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Google Dataflow Intro

  1. 1. Sep 2015 Google Dataflow introduction iglushkov@machinezone.com
  2. 2. What is Google Dataflow ❖ Data processing system: batch and streaming ❖ Set of SDKs ❖ Google Cloud Platform managed services: ❖ Google Compute Engine (VMs) ❖ Google Cloud Storage (r/w data) ❖ BigQuery (r/w data)
  3. 3. Programming Model ❖ Pipeline - entire series of computations ❖ PCollection - set of data in a pipeline ❖ Transform - any data processing operation ❖ Pipeline I/O - data source and data sink APIs
  4. 4. Pipeline ❖ Data + Transforms ❖ Branching + merging ❖ Multiple sources ❖ Unit testing + Integration testing ❖ Pipeline Execution Parameters (local/prod) ❖ Where from, what it looks like, what to do, where store
  5. 5. PCollection ❖ Represent data in a pipeline from any source ❖ Potentially unlimited (stream) ❖ Serializable, immutable, no random access to elements ❖ Deferred data (may have yet to be computed) ❖ Windowing, triggers
  6. 6. Windowing ❖ Window - subdivided logical parts of a PCollection ❖ Each element is assigned to one or more windows ❖ Fixed time windows ❖ Sliding time windows ❖ Per-session windows ❖ Single global windows
  7. 7. Late Data ❖ Event time / Processing time ❖ No order guarantee ❖ No consistent delta b/w Event and Processing time ❖ Watermark ❖ Late data ❖ Triggers to refine windowing, data reporting time
  8. 8. Triggers ❖ Enough data for the window -> aggregate result: “pane” ❖ Help handle late data ❖ Time-based triggers ❖ Data-driven triggers (e.g. certain amount is enough) ❖ Composite triggers: OR, AND - operations on triggers ❖ Window Accumulation modes: accumulate/discard the previous “panes”
  9. 9. Transforms ❖ Math, convert format, grouping, filtering, combining ❖ [PCollection] -> [PCollection] ❖ Core Transforms: ParDo, GroupByKey, Combine, … ❖ Functions with business logic to apply:
 Serializable, Thread-compatible, Idempotent ❖ Composite Transforms
  10. 10. Pipeline I/O ❖ Read/Write from/to external sources ❖ Text Files in Google Cloud Storage or local FS ❖ BigQuery tables ❖ Google Cloud PubSub ❖ Custom Sources and Sinks
  11. 11. Extra ❖ Parallelization, distribution, optimization, scaling ❖ Dataflow monitoring UI and CLI ❖ Logging ❖ Unit testing (locally) any Fn, end-to-end ❖ Introspection toolchain ❖ Update toolchain: for code, windowing configs
  12. 12. Questions?

×