Data Science | Design | Technology
(November 21, 2017)
https://www.meetup.com/DSDTMTL
1
Agenda 6:00 - 6:15: Welcome 
6:15 - 6:50: Machine Learning through Dataflow
6:50 - 7:30: Building a Streaming Pipeline
7:30 - 8:00: Q&A + Networking
2
Data Streaming and Machine Learning
with Google Cloud Platform
3
Maxime Legault-Venne
Software Architect, JDA Labs
Arsho Toubi
Customer Engineer, Google Cloud
Data Streaming and Machine Learning
with Google Cloud Platform
Topic #1
Machine Learning
through Google
Cloud Platform’s
Dataflow
4Data Science | Design | Technology
(Max)
5
Machine Learning Through
Google Cloud Platform’s Dataflow
WHAT’S OUR AIM? Predict how well an item will sell based on past performance of
similar items.
HOW DO WE PROCEED? Machine Learning
1. Train on existing data to understand what
attributes (features) have which impact on
performance of items
2. Use trained model to predict how unknown
items would perform based on their attributes
(features)
MACHINE LEARNING
Training Process
● Generate features for every item (ex. color, brand, pattern)
● Shuffle data
● Split data in training entries (training set + test set)
● Generate hyperparameters from a predefined search space
● Combine hyperparameters and training entries
● Train a model for each combination
○ Evaluate error out of sample
● Combine the errors for same hyperparameter
● Rank every hyperparameter errors to select the best
● Train on all data and keep trained model for prediction
TRAINING ENTRIES
Splitting Data
Training Set
Test Set
3 months
Training entries to get model errors
Final training data set
WHAT’S THE PROBLEM?
Multiple combinations to train
Ex.: For 700 combinations of hyperparameters and training entries,
training took around 12 hours to complete.
HOW TO SOLVE IT? - Parallelize by running all trainings concurrently
- Need a lot of processing power
- Need a lot of memory
APACHE BEAM - Implementation agnostic & open source
- Java & Python SDKs
- Allows to build pipelines
https://beam.apache.org/documentation/runners/capability-matrix/
PIPELINE BASICS 1. Create streaming or batch pipeline
2. Read data from various sources
○ Files
○ Databases
○ Cloud solutions
○ Code generated data
○ ...
3. Apply transforms to process data
4. Write or output final pipeline data
5. Run on a specific runner
○ Direct (locally on your machine)
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Apache Gearpump
https://beam.apache.org/documentation/pipelines/create-your-pipeline/
EXECUTING IN DATAFLOW
RESULTS Testing 700 combinations of hyperparameters and training entries
Previous execution took 12 hours to process sequentially
Whole dataflow job took a bit less than 28 minutes
Topic #2
Building a Dataflow
streaming pipeline
for sentiment
analysis on Twitter
16Data Science | Design | Technology
(Arsho)
Streaming pipelines with
Google Cloud Dataflow
Confidential & Proprietary
Trade-offs and challenges in Big Data
Apache Beam SDK
Cloud Dataflow Service
Batch or Stream
Demo/Code
Getting Started Resources
Agenda
Confidential & Proprietary
AccuracySpeed
Cost Control Complexity
Time to Answer
Introduction
The Tense Quadrachotomy of Big Data
Confidential & Proprietary
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available messaging
Fully Managed, No-Ops Services
Introduction
Google Managed Services Toolbox for Big Data
Confidential & Proprietary
Cloud Dataflow is a collection
of SDKs for building batch or
streaming parallelized data
processing pipelines.
Cloud Dataflow is a fully managed
service for executing optimized
parallelized data processing
pipelines.
Introduction
Google Cloud Dataflow
Cloud Dataflow SDK
Confidential & Proprietary
Cloud Dataflow SDK
Dataflow Benefits
❯ Unified programming model for both batch & stream processing
• Independent from the execution back-end aka “runner”
❯ Google driven & open sourced
• Java 7
• Python 2 (streaming is in Alpha)
Confidential & Proprietary
<- At once guarantee (modulo completeness
thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Completeness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
Confidential & Proprietary
• A Directed Acyclic Graph of data
processing transformations
• Can be submitted to the Dataflow
Service for optimization and execution
or executed on an alternate runner e.g.
Spark
• May include multiple inputs and multiple
outputs
• May encompass many logical
MapReduce operations
• PCollections flow through the pipeline
Pipeline
Cloud Dataflow SDK
Confidential & Proprietary
❯ Read from standard Google Cloud
Platform data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by
teaching Dataflow how to read it in
parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON,
XML, Avro formatted data
Cloud Dataflow SDK
Inputs & Outputs
Your
Source/Sink
Here
Confidential & Proprietary
{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a PCollection
independently using a user-provided
DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() - N
times, finishBundle()
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e.
ParDo->GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK
Confidential & Proprietary
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values with
the same key
• Corresponds to the shuffle phase in
Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
Confidential & Proprietary
❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on an
unbounded PCollection, but can also be used for
bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival time or custom event time
❯ Watermarks + Triggers enable robust
completeness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime
Confidential & Proprietary
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join, Min, Max,
Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse, etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK
Confidential & Proprietary
Run the same code in multiple modes using different runners
❯ Direct Runner
• For local, in-memory execution.
• Great for developing and unit tests
❯ Cloud Dataflow Service Runner
• Runs on the fully-manage Dataflow Service
• Your code runs distributed across GCE instances
❯ Community sourced
• Spark runner @ github.com/cloudera/spark-dataflow - Thanks Josh!
• Flink runner coming soon from dataArtisans
Cloud Dataflow Runners
Cloud Dataflow SDK
Confidential & Proprietary
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Life of a Pipeline
Cloud Dataflow Service
Progress & Logs
Confidential & Proprietary
• At-once processing*
• Graph optimization (ref. FlumeJava)
• Worker lifecycle management
• Worker resource scaling
• Worker scaling
• Restful management API and CLI
• Real-time job monitoring, Cloud Debugger & Cloud Logging integration
• Project based security with auto wipeout
* no enforcement on external service idempotency, dependant upon correctness thresholds
Cloud Dataflow Service Benefits
Cloud Dataflow Service
Confidential & Proprietary
❯ ParDo fusion
• Producer Consumer
• Sibling
• Intelligent fusion boundaries
❯ Combiner lifting e.g. partial aggregations before reduction
❯ Flatten unzipping
❯ Reshard placement
...
Graph Optimization
Cloud Dataflow Service
Confidential & Proprietary
= ParallelDo
GBK = GroupByKey
+
=
CombineValues
C
consumer-producer sibling
Optimiser: ParallelDo Fusion
Cloud Dataflow Service
D C D
C+D
C+D
Confidential & Proprietary
Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
Confidential & Proprietary
800 RPS 1,200 RPS 5,000 RPS 50 RPS
Worker Scaling
Cloud Dataflow Service
Confidential & Proprietary
100 mins. 65 mins.vs.
Worker Optimization
Cloud Dataflow Service
Confidential & Proprietary
Optimizing Your Time
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
ReliabilityDeployment & configuration
Handling
growing scale
Utilization
improvements
Typical Data Processing
Programming
Cloud Dataflow Service
Data Processing with Cloud Dataflow
Batch or Streaming
Confidential & Proprietary
● Boundedness
○ Bounded data - finite data set, fixed in schema, is complete regardless of
time, typically at rest in a common durable store
○ Unbounded data - infinite, potentially changing schema, is never complete,
typically not at rest and stored in multiple temporary yet durable stores
● Time to answer
○ Batch processing presents risks of increased cost (under-utilized resources),
increased time to answer and decrease of correctness (late arriving events)
Considerations
Batch / Streaming
Confidential & Proprietary
Latency
There are situations where batch
processing growing datasets
breaks down.
Batch failure mode #1: time to answer
Batch / Streaming
The first is latency-sensitive
processing. You can't use an hourly
or daily batch job to do low-latency
fraud, abuse or anomaly detection.
Confidential & Proprietary
MapReduce
TuesdayWednesday
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
The second is sessions: batch processing of individual chunks doesn't account for sessions
across batch boundaries. This is a real problem if you cannot afford to miss or duplicate
important sessions, or generally need to do any cross-chunk analysis. It also gets worse as you
decrease the chunk size.
Batch failure mode #2: Sessions
Batch / Streaming
Confidential & Proprietary
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Element-wise transformations
Batch / Streaming
A streaming pipeline naturally handles unbounded, infinite collections of data.
Element-wise transformations like filtering can simply be applied as elements flow past.
Confidential & Proprietary
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
Batch / Streaming
However, for aggregations that require combining multiple elements together, we need to divide the infinite stream of elements
into finite sized chunks that can be processed independently.
The simplest way to do this is just to take whatever elements we see in a fixed time period.
But, elements often get delayed, so this might mean we’re processing a bunch of events where most occurred between 1 and
2pm, but there are still a few stragglers from 9am showing up.
Confidential & Proprietary
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Streaming Patterns: Event-Time Based Windows
Batch / Streaming
Demo/Code
Confidential & Proprietary
Demo Architecture Overview
Demo/Code
a. Python script on GKE listens for #cloud status updates and pushes them to pub/sub
b. Dataflow :
i. Pulls from pub/sub
ii. Sends text of matching mentions to GCP NLP API
iii. Loads output into BigQuery
c. Datastudio connects to BigQuery datasource
d. Datastudio generates visualization
(This demo is not a PROD-ready implementation)
Cloud Pub/Sub Cloud Dataflow Big Query Data StudioGKE
NLP API
Confidential & Proprietary
Pipeline IO
(Text)
Pub/Sub
PTransform
Pipeline IO
(Text)
Big Query
NLP analysis
Read from Pub/Sub
Write to Big Query
with specified schema
Demo pipeline (Python SDK)
Demo/Code
Getting Started Resources
Confidential & Proprietary
❯ cloud.google.com/dataflow
❯ stackoverflow.com/questions/tagged/google-cloud-dataflow
❯ github.com/GoogleCloudPlatform/DataflowJavaSDK
Resources
Getting Started
Confidential & Proprietary
Thank you
Wrap-up
53Data Science | Design | Technology
(JL)
● Last meetup for 2017… Next meetup in January 2018
54
Data Science | Design | Technology
● 1000 members and counting...
● 2018: More co-presentations. Your meetup, your topics
● Special thanks to speakers, hosts, sponsors, members
Merci / Thank You
55
@jdalabsmtl
Data Science | Design | Technology
(Check for next DSDT meetup at https://www.meetup.com/DSDTMTL)

DSDT Meetup Nov 2017

  • 1.
    Data Science |Design | Technology (November 21, 2017) https://www.meetup.com/DSDTMTL 1
  • 2.
    Agenda 6:00 -6:15: Welcome  6:15 - 6:50: Machine Learning through Dataflow 6:50 - 7:30: Building a Streaming Pipeline 7:30 - 8:00: Q&A + Networking 2 Data Streaming and Machine Learning with Google Cloud Platform
  • 3.
    3 Maxime Legault-Venne Software Architect,JDA Labs Arsho Toubi Customer Engineer, Google Cloud Data Streaming and Machine Learning with Google Cloud Platform
  • 4.
    Topic #1 Machine Learning throughGoogle Cloud Platform’s Dataflow 4Data Science | Design | Technology (Max)
  • 5.
    5 Machine Learning Through GoogleCloud Platform’s Dataflow
  • 6.
    WHAT’S OUR AIM?Predict how well an item will sell based on past performance of similar items.
  • 7.
    HOW DO WEPROCEED? Machine Learning 1. Train on existing data to understand what attributes (features) have which impact on performance of items 2. Use trained model to predict how unknown items would perform based on their attributes (features)
  • 8.
    MACHINE LEARNING Training Process ●Generate features for every item (ex. color, brand, pattern) ● Shuffle data ● Split data in training entries (training set + test set) ● Generate hyperparameters from a predefined search space ● Combine hyperparameters and training entries ● Train a model for each combination ○ Evaluate error out of sample ● Combine the errors for same hyperparameter ● Rank every hyperparameter errors to select the best ● Train on all data and keep trained model for prediction
  • 9.
    TRAINING ENTRIES Splitting Data TrainingSet Test Set 3 months Training entries to get model errors Final training data set
  • 10.
    WHAT’S THE PROBLEM? Multiplecombinations to train Ex.: For 700 combinations of hyperparameters and training entries, training took around 12 hours to complete.
  • 11.
    HOW TO SOLVEIT? - Parallelize by running all trainings concurrently - Need a lot of processing power - Need a lot of memory
  • 12.
    APACHE BEAM -Implementation agnostic & open source - Java & Python SDKs - Allows to build pipelines https://beam.apache.org/documentation/runners/capability-matrix/
  • 13.
    PIPELINE BASICS 1.Create streaming or batch pipeline 2. Read data from various sources ○ Files ○ Databases ○ Cloud solutions ○ Code generated data ○ ... 3. Apply transforms to process data 4. Write or output final pipeline data 5. Run on a specific runner ○ Direct (locally on your machine) ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Apache Gearpump https://beam.apache.org/documentation/pipelines/create-your-pipeline/
  • 14.
  • 15.
    RESULTS Testing 700combinations of hyperparameters and training entries Previous execution took 12 hours to process sequentially Whole dataflow job took a bit less than 28 minutes
  • 16.
    Topic #2 Building aDataflow streaming pipeline for sentiment analysis on Twitter 16Data Science | Design | Technology (Arsho)
  • 17.
  • 18.
    Confidential & Proprietary Trade-offsand challenges in Big Data Apache Beam SDK Cloud Dataflow Service Batch or Stream Demo/Code Getting Started Resources Agenda
  • 19.
    Confidential & Proprietary AccuracySpeed CostControl Complexity Time to Answer Introduction The Tense Quadrachotomy of Big Data
  • 20.
    Confidential & Proprietary BigQuery Ingestdata at 100,000 rows per second Dataflow Stream & batch processing, unified and simplified Pub/Sub Scalable, flexible, and globally available messaging Fully Managed, No-Ops Services Introduction Google Managed Services Toolbox for Big Data
  • 21.
    Confidential & Proprietary CloudDataflow is a collection of SDKs for building batch or streaming parallelized data processing pipelines. Cloud Dataflow is a fully managed service for executing optimized parallelized data processing pipelines. Introduction Google Cloud Dataflow
  • 22.
  • 23.
    Confidential & Proprietary CloudDataflow SDK Dataflow Benefits ❯ Unified programming model for both batch & stream processing • Independent from the execution back-end aka “runner” ❯ Google driven & open sourced • Java 7 • Python 2 (streaming is in Alpha)
  • 24.
    Confidential & Proprietary <-At once guarantee (modulo completeness thresholds) Cloud Dataflow SDK <- Aggregations, Filters, Joins, ... <- Completeness Pipeline{ Who => Inputs What => Transforms Where => Windows When => Watermarks + Triggers To => Outputs } Cloud Dataflow SDK - Logical Model
  • 25.
    Confidential & Proprietary •A Directed Acyclic Graph of data processing transformations • Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark • May include multiple inputs and multiple outputs • May encompass many logical MapReduce operations • PCollections flow through the pipeline Pipeline Cloud Dataflow SDK
  • 26.
    Confidential & Proprietary ❯Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore ❯ Write your own custom source by teaching Dataflow how to read it in parallel • Currently for bounded sources only ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data Cloud Dataflow SDK Inputs & Outputs Your Source/Sink Here
  • 27.
    Confidential & Proprietary {Seahawks,NFC, Champions, Seattle, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} ❯ Processes each element of a PCollection independently using a user-provided DoFn ❯ Elements are processed in arbitrary ‘bundles’ e.g. “shards” • startBundle(), processElement() - N times, finishBundle() ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo KeyBySessionId ParDo (“Parallel Do”) Cloud Dataflow SDK
  • 28.
    Confidential & Proprietary Waita minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop Cloud Dataflow SDK GroupByKey {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
  • 29.
    Confidential & Proprietary ❯Logically divide up or groups the elements of a PCollection into finite windows • Fixed Windows: hourly, daily, … • Sliding Windows • Sessions ❯ Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections ❯ Window.into() can be called at any point in the pipeline and will be applied when needed ❯ Can be tied to arrival time or custom event time ❯ Watermarks + Triggers enable robust completeness Windows Cloud Dataflow SDK Nighttime Mid-Day Nighttime
  • 30.
    Confidential & Proprietary GroupByKey PairWith Ones Sum Values Count ❯ Define new PTransforms by building up subgraphs of existing transforms ❯ Some utilities are included in the SDK • Count, RemoveDuplicates, Join, Min, Max, Sum, ... ❯ You can define your own: • DoSomething, DoSomethingElse, etc. ❯ Why bother? • Code reuse • Better monitoring experience Composite PTransforms Cloud Dataflow SDK
  • 31.
    Confidential & Proprietary Runthe same code in multiple modes using different runners ❯ Direct Runner • For local, in-memory execution. • Great for developing and unit tests ❯ Cloud Dataflow Service Runner • Runs on the fully-manage Dataflow Service • Your code runs distributed across GCE instances ❯ Community sourced • Spark runner @ github.com/cloudera/spark-dataflow - Thanks Josh! • Flink runner coming soon from dataArtisans Cloud Dataflow Runners Cloud Dataflow SDK
  • 32.
    Confidential & Proprietary GCP ManagedService User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Life of a Pipeline Cloud Dataflow Service Progress & Logs
  • 33.
    Confidential & Proprietary •At-once processing* • Graph optimization (ref. FlumeJava) • Worker lifecycle management • Worker resource scaling • Worker scaling • Restful management API and CLI • Real-time job monitoring, Cloud Debugger & Cloud Logging integration • Project based security with auto wipeout * no enforcement on external service idempotency, dependant upon correctness thresholds Cloud Dataflow Service Benefits Cloud Dataflow Service
  • 34.
    Confidential & Proprietary ❯ParDo fusion • Producer Consumer • Sibling • Intelligent fusion boundaries ❯ Combiner lifting e.g. partial aggregations before reduction ❯ Flatten unzipping ❯ Reshard placement ... Graph Optimization Cloud Dataflow Service
  • 35.
    Confidential & Proprietary =ParallelDo GBK = GroupByKey + = CombineValues C consumer-producer sibling Optimiser: ParallelDo Fusion Cloud Dataflow Service D C D C+D C+D
  • 36.
    Confidential & Proprietary DeploySchedule & Monitor Tear Down Worker Lifecycle Management Cloud Dataflow Service
  • 37.
    Confidential & Proprietary 800RPS 1,200 RPS 5,000 RPS 50 RPS Worker Scaling Cloud Dataflow Service
  • 38.
    Confidential & Proprietary 100mins. 65 mins.vs. Worker Optimization Cloud Dataflow Service
  • 39.
    Confidential & Proprietary OptimizingYour Time More time to dig into your data Programming Resource provisioning Performance tuning Monitoring ReliabilityDeployment & configuration Handling growing scale Utilization improvements Typical Data Processing Programming Cloud Dataflow Service Data Processing with Cloud Dataflow
  • 40.
  • 41.
    Confidential & Proprietary ●Boundedness ○ Bounded data - finite data set, fixed in schema, is complete regardless of time, typically at rest in a common durable store ○ Unbounded data - infinite, potentially changing schema, is never complete, typically not at rest and stored in multiple temporary yet durable stores ● Time to answer ○ Batch processing presents risks of increased cost (under-utilized resources), increased time to answer and decrease of correctness (late arriving events) Considerations Batch / Streaming
  • 42.
    Confidential & Proprietary Latency Thereare situations where batch processing growing datasets breaks down. Batch failure mode #1: time to answer Batch / Streaming The first is latency-sensitive processing. You can't use an hourly or daily batch job to do low-latency fraud, abuse or anomaly detection.
  • 43.
    Confidential & Proprietary MapReduce TuesdayWednesday Jose Lisa Ingo Asha Cheryl Ari WednesdayTuesday Thesecond is sessions: batch processing of individual chunks doesn't account for sessions across batch boundaries. This is a real problem if you cannot afford to miss or duplicate important sessions, or generally need to do any cross-chunk analysis. It also gets worse as you decrease the chunk size. Batch failure mode #2: Sessions Batch / Streaming
  • 44.
    Confidential & Proprietary 13:0014:008:00 9:00 10:00 11:00 12:00 Processing Time Streaming Patterns: Element-wise transformations Batch / Streaming A streaming pipeline naturally handles unbounded, infinite collections of data. Element-wise transformations like filtering can simply be applied as elements flow past.
  • 45.
    Confidential & Proprietary 13:0014:008:00 9:00 10:00 11:00 12:00 Processing Time Streaming Patterns: Aggregating Time Based Windows Batch / Streaming However, for aggregations that require combining multiple elements together, we need to divide the infinite stream of elements into finite sized chunks that can be processed independently. The simplest way to do this is just to take whatever elements we see in a fixed time period. But, elements often get delayed, so this might mean we’re processing a bunch of events where most occurred between 1 and 2pm, but there are still a few stragglers from 9am showing up.
  • 46.
    Confidential & Proprietary EventTime Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output Streaming Patterns: Event-Time Based Windows Batch / Streaming
  • 47.
  • 48.
    Confidential & Proprietary DemoArchitecture Overview Demo/Code a. Python script on GKE listens for #cloud status updates and pushes them to pub/sub b. Dataflow : i. Pulls from pub/sub ii. Sends text of matching mentions to GCP NLP API iii. Loads output into BigQuery c. Datastudio connects to BigQuery datasource d. Datastudio generates visualization (This demo is not a PROD-ready implementation) Cloud Pub/Sub Cloud Dataflow Big Query Data StudioGKE NLP API
  • 49.
    Confidential & Proprietary PipelineIO (Text) Pub/Sub PTransform Pipeline IO (Text) Big Query NLP analysis Read from Pub/Sub Write to Big Query with specified schema Demo pipeline (Python SDK) Demo/Code
  • 50.
  • 51.
    Confidential & Proprietary ❯cloud.google.com/dataflow ❯ stackoverflow.com/questions/tagged/google-cloud-dataflow ❯ github.com/GoogleCloudPlatform/DataflowJavaSDK Resources Getting Started
  • 52.
  • 53.
    Wrap-up 53Data Science |Design | Technology (JL)
  • 54.
    ● Last meetupfor 2017… Next meetup in January 2018 54 Data Science | Design | Technology ● 1000 members and counting... ● 2018: More co-presentations. Your meetup, your topics ● Special thanks to speakers, hosts, sponsors, members
  • 55.
    Merci / ThankYou 55 @jdalabsmtl Data Science | Design | Technology (Check for next DSDT meetup at https://www.meetup.com/DSDTMTL)