Production-Ready BIG ML Workflows - from zero to hero

By
Daniel Marcous
Google, Waze, Data Wizard
dmarcous@gmail/google.com
Big Data Analytics :
Production Ready Flows &
Waze Use Cases

Rules
1. Interactive is interesting.
2. If you got something to say, say!
3. Be open minded - I’m sure I got something to
learn from you, hope you got something to
learn from me.

What’s a Data Wizard you ask?
Gain Actionable Insights!

What’s here?
Methodology
Deploying big models to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.

Google in just 1 minute:
1000 new
devices
3M Searches 100 Hours
1B Activated
Devices
100M GB
Search
Content

10+ Years of Tackling Big Data Problems
8
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume
Java
Millwheel
Open
Source
2005
Google
Cloud
Products BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache
Beam
Tensorflow

“Google is living a few years in the
future and sending the rest of us
messages”
Doug Cutting, Hadoop Co-Creator

Bigger is better
● More processing power
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Keep training until you hit 0
○ Some models can not overfit when
optimising until training error is 0.
■ RF - more trees
■ ANN - more iterations
● Handle BIG data
○ Tons of training data (if you have it) - no
need for sampling on wrong populations!
○ Millions of features? Easy… (text
processing with TF-IDF)
○ Some models (ANN) can’t do good without
training on a lot of data.

Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings
○ Different metric readings
■ Different implementations (distributed VS central memory)
■ Different programming language (heuristics)
○ Different populations trained on (sampling)

Measure first, optimize
second.

Before you start
● Create example input
○ Raw input
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ Coverage
■ Amount of subjects affected
■ Sometimes measures as average
precision per K random subjects.
Remember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
● Create example output
○ Featured input
○ Prediction rows
Naive
Matrix
1
1
2
3
3

Preprocess
● Naive feature matrix
○ Parse (Text -> RDD[Object] -> DataFrame)
○ Clean (remove outliers / bad records)
○ Join
○ Remove non-features
● Get real data
● Create a baseline dataset for training
○ Add some basic features
■ Day of week / hour / etc.
○ Write a READABLE CSV that you can start and work with.

Preprocess
Case Class RDD to DataFrame
RDD[String] to Case Class RDD
String row to object

Preprocess
Parse string to Object with
java.sql types

Metric Generation
Craft useful metrics.
Pre class metrics
Confusion matrix by hand

Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis
○ Anomaly detection - Does a metric suddenly drastically change?
○ Impact analysis - Did deploying a model had a significant effect on metric change?

● Web application framework for R.
○ Introduces user interaction to analysis
○ Combines ad-hoc testing with R statistical / modeling power
● Turns R function wrappers to interactive dashboard elements.
○ Generates HTML, CSS, JS behind the scenes so you only write R.
● Get started
● Get inspired
● Shiny @Waze
Shiny

Dashboard
monitoring
Dashboard should support - picking
different models, comparing metrics.
Pick models
to compare
Statistical tests on distributions
t.test / AUC

Dashboard
monitoring
Dashboard should support -
Timeseries anomaly detection, and
impact analysis (deploying new model)

Reduce the problem
● Tradeoff : Time to market VS Loss of accuracy
● Sample data
○ Is random actually what you want?
■ Keep label distributions
■ Keep important features distributions
● Test everything you believe worthy
○ Choose model
○ Choose features (important when you go big)
■ Leave the “boarder” significant ones in
○ Test different parameter configurations (you’ll need to validate your choice later)
Remember : This isn’t your production model. You’re only getting a sense of the data for now.

Getting a feel
Exploring a dataset with R.
Dividing data to training and testing.
Random partitioning

Getting a feel
Logistic regression and basic variable
selection with R.
Logistic regression
Variable significance test

Getting a feel
Advanced variable selection with
regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model

Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted
trees model

Getting a feel
Modeling bigger data with R, using
parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel

Basic moving parts
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments

Flow motives
● Only 1 job for preprocessing
○ Used in both training and serving - reduces risk of training on wrong population
○ Should also be used before sampling when experimenting on a smaller scale.
○ When data sources are different for training and serving (RT VS Batch for example) use
interfaces!
● Saving training & scoring feature matrices aside
○ Try new algorithms / parameters on the same data
○ Measure changes on same data as used in production.

Reusable flow code
Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
SparkSQL UDFs
Implement feature generation -
decouples training and serving
Data cleaning work

Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
Reusable flow code
Generate feature matrix
Blackbox from app view

Good ML code trumps
performance.

Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
○ Easier to read
○ Easier to change code - targeted changes only affect their specific process
○ One input, one output (almost…)
● Easier to tweak and deploy changes

@Test
● Suppose to happen throughout development, if not - now is the time to
make sure you have it!
○ Data read correctly
■ Null rates?
○ Features calculated correctly
■ Does my complex join / function / logic return what is should?
○ Access
■ Can I access all the data sources from my “production” account?
○ Formats
■ Adapt for variance in non-structured formats such as JSONs
○ Required Latency

Set up a baseline.
Start with a neutral launch

● Take a snapshot of your metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
○ Training takes X hours
○ Serving predictions on Y records takes X seconds
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.

Go to work.
Coffee recommended at this point.

Optimize
What? How?
● Grid search over
parameters
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Cross validate Everything
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark 1.6
● Tweak training
○ Different models
○ Different model parameters

Spark ML
Building an easy to use wrapper
around training and serving.
Build model pipeline, train, evaluate
Not necessarily a random split

Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline

Spark ML
Cross-validate, grid search params and
evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.

Spark ML
Score a feature matrix and parse
output.
Get probability for predicted class
(default is a probability vector for all classes)

● Same data, different results
○ Use preprocessed feature matrix (same one used for current model)
● Best testing - production A/B test
○ Use current production model and new model in parallel
● Metrics improvements (Remember your dashboard?)
○ Time series analysis of metrics
○ Compare metrics over different code versions (improves preprocessing / modeling)
● Deploy / Revert = Update user assignments
○ Based on new metrics / feedback loop if possible
Compare to baseline

A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files

● Respond to anomalies (alerts) on metric reads
● Try out new stuff
○ Tech versions (e.g. new Spark version)
○ New data sources
○ New features
● When you find something interesting - “Go to Work.”
Constant improvement
Remember : Trends and industries change, re-training on new data is not a bad thing.

● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to compile anything!
Enter Apache Zeppelin

Playing with it
Setting up zeppelin to user our jars.

Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL

Playing with it
Using spark-csv by Databricks.
CSV to DataFrame by
Databricks

Using user compiled code inside a
notebook.
Playing with it
Bring your own code

Keep in mind
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Algorithmic Richness
● Using Parquet
○ Intermediate outputs
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Output size
○ Coalesce to desired size
● Dataframe Windows - Buggy
○ Write your own over RDD
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory

Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trums performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)

Code:
https://github.com/dmarcous/BigMLFlow/
Slides:
http://www.slideshare.net/DanielMarcous/productionready
-big-ml-workflows-from-zero-to-hero

Use Cases
What Waze does with all its data?

Trending Locations / Day of Week Breakdown

Optimising - Ad clicks / Time from drive start

Time to Content (US) - Day of week / Category

Irregular Events / Anomaly Detection
Major events, causing out of the ordinary traffic/road blocks etc’ affecting large
numbers of users.

Dangerous Places - Clustering
Find most dangerous areas / streets, using custom developed clustering algorithms
● Alert authorities / users
● Compare & share with 3rd parties (NYPD)

Parking Places Detection
Parking entrance
Parking lot
Street parking

Server Distribution Optimisation
Calculate the optimal routing servers distribution according to geographical load.
● Better experience - faster response time
● Saves money - no need for redundant elastic scaling of servers

Text Mining - Topic Analysis
Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice
wazers usual road social still morgan
eta traffic driving drivers will ang
con stay info reporting update freeman
zona today using helped drive kanan
usando times area nearby delay voice
real clear realtime traffic add meter
tiempo slower sharing jam jammed kan
carretera accident soci drive near masuk

Text Mining - New Version Impressions
● Text analysis - stemming / stopword detection etc.
● Topic modeling
● Sentiment analysis
Waze V4 update :
● Good - “redesign”, ”smarter”, “cleaner”, “improved”
● Bad - “stuck”
Overall very positive score!

Text Mining - Store Sentiments

Text Mining - Sentiment by Time & Place

Daniel Marcous
dmarcous@google.com
dmarcous@gmail.com

Production-Ready BIG ML Workflows - from zero to hero

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Production-Ready BIG ML Workflows - from zero to hero

Similar to Production-Ready BIG ML Workflows - from zero to hero (20)

More from Daniel Marcous

More from Daniel Marcous (6)

Recently uploaded

Recently uploaded (20)

Production-Ready BIG ML Workflows - from zero to hero