DSDT Meetup Nov 2017

Data Science | Design | Technology
(November 21, 2017)
https://www.meetup.com/DSDTMTL
1

Agenda 6:00 - 6:15: Welcome
6:15 - 6:50: Machine Learning through Dataflow
6:50 - 7:30: Building a Streaming Pipeline
7:30 - 8:00: Q&A + Networking
2
Data Streaming and Machine Learning
with Google Cloud Platform

3
Maxime Legault-Venne
Software Architect, JDA Labs
Arsho Toubi
Customer Engineer, Google Cloud
Data Streaming and Machine Learning
with Google Cloud Platform

Topic #1
Machine Learning
through Google
Cloud Platform’s
Dataflow
4Data Science | Design | Technology
(Max)

5
Machine Learning Through
Google Cloud Platform’s Dataflow

WHAT’S OUR AIM? Predict how well an item will sell based on past performance of
similar items.

HOW DO WE PROCEED? Machine Learning
1. Train on existing data to understand what
attributes (features) have which impact on
performance of items
2. Use trained model to predict how unknown
items would perform based on their attributes
(features)

MACHINE LEARNING
Training Process
● Generate features for every item (ex. color, brand, pattern)
● Shuffle data
● Split data in training entries (training set + test set)
● Generate hyperparameters from a predefined search space
● Combine hyperparameters and training entries
● Train a model for each combination
○ Evaluate error out of sample
● Combine the errors for same hyperparameter
● Rank every hyperparameter errors to select the best
● Train on all data and keep trained model for prediction

TRAINING ENTRIES
Splitting Data
Training Set
Test Set
3 months
Training entries to get model errors
Final training data set

WHAT’S THE PROBLEM?
Multiple combinations to train
Ex.: For 700 combinations of hyperparameters and training entries,
training took around 12 hours to complete.

HOW TO SOLVE IT? - Parallelize by running all trainings concurrently
- Need a lot of processing power
- Need a lot of memory

APACHE BEAM - Implementation agnostic & open source
- Java & Python SDKs
- Allows to build pipelines
https://beam.apache.org/documentation/runners/capability-matrix/

PIPELINE BASICS 1. Create streaming or batch pipeline
2. Read data from various sources
○ Files
○ Databases
○ Cloud solutions
○ Code generated data
○ ...
3. Apply transforms to process data
4. Write or output final pipeline data
5. Run on a specific runner
○ Direct (locally on your machine)
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Apache Gearpump
https://beam.apache.org/documentation/pipelines/create-your-pipeline/

RESULTS Testing 700 combinations of hyperparameters and training entries
Previous execution took 12 hours to process sequentially
Whole dataflow job took a bit less than 28 minutes

Topic #2
Building a Dataflow
streaming pipeline
for sentiment
analysis on Twitter
(Arsho)

Streaming pipelines with
Google Cloud Dataflow

Confidential & Proprietary
Trade-offs and challenges in Big Data
Apache Beam SDK
Cloud Dataflow Service
Batch or Stream
Demo/Code
Getting Started Resources
Agenda

AccuracySpeed
Cost Control Complexity
Time to Answer
Introduction
The Tense Quadrachotomy of Big Data

BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available messaging
Fully Managed, No-Ops Services
Introduction
Google Managed Services Toolbox for Big Data

Cloud Dataflow is a collection
of SDKs for building batch or
streaming parallelized data
processing pipelines.
Cloud Dataflow is a fully managed
service for executing optimized
parallelized data processing
pipelines.
Introduction
Google Cloud Dataflow

Cloud Dataflow SDK
Dataflow Benefits
❯ Unified programming model for both batch & stream processing
• Independent from the execution back-end aka “runner”
❯ Google driven & open sourced
• Java 7
• Python 2 (streaming is in Alpha)

<- At once guarantee (modulo completeness
thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Completeness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model

• A Directed Acyclic Graph of data
processing transformations
• Can be submitted to the Dataflow
Service for optimization and execution
or executed on an alternate runner e.g.
Spark
• May include multiple inputs and multiple
outputs
• May encompass many logical
MapReduce operations
• PCollections flow through the pipeline
Pipeline
Cloud Dataflow SDK

❯ Read from standard Google Cloud
Platform data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by
teaching Dataflow how to read it in
parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON,
XML, Avro formatted data
Cloud Dataflow SDK
Inputs & Outputs
Your
Source/Sink
Here

{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a PCollection
independently using a user-provided
DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() - N
times, finishBundle()
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e.
ParDo->GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK

Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values with
the same key
• Corresponds to the shuffle phase in
Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}

❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on an
unbounded PCollection, but can also be used for
bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival time or custom event time
❯ Watermarks + Triggers enable robust
completeness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime

GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join, Min, Max,
Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse, etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK

Run the same code in multiple modes using different runners
❯ Direct Runner
• For local, in-memory execution.
• Great for developing and unit tests
❯ Cloud Dataflow Service Runner
• Runs on the fully-manage Dataflow Service
• Your code runs distributed across GCE instances
❯ Community sourced
• Spark runner @ github.com/cloudera/spark-dataflow - Thanks Josh!
• Flink runner coming soon from dataArtisans
Cloud Dataflow Runners
Cloud Dataflow SDK

GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Life of a Pipeline
Progress & Logs

• At-once processing*
• Graph optimization (ref. FlumeJava)
• Worker lifecycle management
• Worker resource scaling
• Worker scaling
• Restful management API and CLI
• Real-time job monitoring, Cloud Debugger & Cloud Logging integration
• Project based security with auto wipeout
* no enforcement on external service idempotency, dependant upon correctness thresholds
Cloud Dataflow Service Benefits

❯ ParDo fusion
• Producer Consumer
• Sibling
• Intelligent fusion boundaries
❯ Combiner lifting e.g. partial aggregations before reduction
❯ Flatten unzipping
❯ Reshard placement
...
Graph Optimization

= ParallelDo
GBK = GroupByKey
+
=
CombineValues
C
consumer-producer sibling
Optimiser: ParallelDo Fusion
D C D
C+D
C+D

Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management

800 RPS 1,200 RPS 5,000 RPS 50 RPS
Worker Scaling

100 mins. 65 mins.vs.
Worker Optimization

Optimizing Your Time
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
ReliabilityDeployment & configuration
Handling
growing scale
Utilization
improvements
Typical Data Processing
Programming
Data Processing with Cloud Dataflow

● Boundedness
○ Bounded data - finite data set, fixed in schema, is complete regardless of
time, typically at rest in a common durable store
○ Unbounded data - infinite, potentially changing schema, is never complete,
typically not at rest and stored in multiple temporary yet durable stores
● Time to answer
○ Batch processing presents risks of increased cost (under-utilized resources),
increased time to answer and decrease of correctness (late arriving events)
Considerations
Batch / Streaming

Latency
There are situations where batch
processing growing datasets
breaks down.
Batch failure mode #1: time to answer
Batch / Streaming
The first is latency-sensitive
processing. You can't use an hourly
or daily batch job to do low-latency
fraud, abuse or anomaly detection.

MapReduce
TuesdayWednesday
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
The second is sessions: batch processing of individual chunks doesn't account for sessions
across batch boundaries. This is a real problem if you cannot afford to miss or duplicate
important sessions, or generally need to do any cross-chunk analysis. It also gets worse as you
decrease the chunk size.
Batch failure mode #2: Sessions
Batch / Streaming

13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Element-wise transformations
Batch / Streaming
A streaming pipeline naturally handles unbounded, infinite collections of data.
Element-wise transformations like filtering can simply be applied as elements flow past.

13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
Batch / Streaming
However, for aggregations that require combining multiple elements together, we need to divide the infinite stream of elements
into finite sized chunks that can be processed independently.
The simplest way to do this is just to take whatever elements we see in a fixed time period.
But, elements often get delayed, so this might mean we’re processing a bunch of events where most occurred between 1 and
2pm, but there are still a few stragglers from 9am showing up.

Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Streaming Patterns: Event-Time Based Windows
Batch / Streaming

Demo Architecture Overview
Demo/Code
a. Python script on GKE listens for #cloud status updates and pushes them to pub/sub
b. Dataflow :
i. Pulls from pub/sub
ii. Sends text of matching mentions to GCP NLP API
iii. Loads output into BigQuery
c. Datastudio connects to BigQuery datasource
d. Datastudio generates visualization
(This demo is not a PROD-ready implementation)
Cloud Pub/Sub Cloud Dataflow Big Query Data StudioGKE
NLP API

Pipeline IO
(Text)
Pub/Sub
PTransform
Pipeline IO
(Text)
Big Query
NLP analysis
Read from Pub/Sub
Write to Big Query
with specified schema
Demo pipeline (Python SDK)
Demo/Code

❯ cloud.google.com/dataflow
❯ stackoverflow.com/questions/tagged/google-cloud-dataflow
❯ github.com/GoogleCloudPlatform/DataflowJavaSDK
Resources
Getting Started

Thank you

Wrap-up
(JL)

● Last meetup for 2017… Next meetup in January 2018
54
● 1000 members and counting...
● 2018: More co-presentations. Your meetup, your topics
● Special thanks to speakers, hosts, sponsors, members

Merci / Thank You
55
@jdalabsmtl
(Check for next DSDT meetup at https://www.meetup.com/DSDTMTL)

DSDT Meetup Nov 2017

More Related Content

What's hot

Similar to DSDT Meetup Nov 2017

More from DSDT_MTL

Recently uploaded

DSDT Meetup Nov 2017