Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Production-Ready BIG ML Workflows - from zero to hero
1. By
Daniel Marcous
Google, Waze, Data Wizard
dmarcous@gmail/google.com
Big Data Analytics :
Production Ready Flows &
Waze Use Cases
2. Rules
1. Interactive is interesting.
2. If you got something to say, say!
3. Be open minded - I’m sure I got something to
learn from you, hope you got something to
learn from me.
3. What’s a Data Wizard you ask?
Gain Actionable Insights!
5. What’s here?
Methodology
Deploying big models to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
11. Bigger is better
● More processing power
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Keep training until you hit 0
○ Some models can not overfit when
optimising until training error is 0.
■ RF - more trees
■ ANN - more iterations
● Handle BIG data
○ Tons of training data (if you have it) - no
need for sampling on wrong populations!
○ Millions of features? Easy… (text
processing with TF-IDF)
○ Some models (ANN) can’t do good without
training on a lot of data.
13. Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings
○ Different metric readings
■ Different implementations (distributed VS central memory)
■ Different programming language (heuristics)
○ Different populations trained on (sampling)
16. Before you start
● Create example input
○ Raw input
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ Coverage
■ Amount of subjects affected
■ Sometimes measures as average
precision per K random subjects.
Remember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
● Create example output
○ Featured input
○ Prediction rows
Naive
Matrix
1
1
2
3
3
17. Preprocess
● Naive feature matrix
○ Parse (Text -> RDD[Object] -> DataFrame)
○ Clean (remove outliers / bad records)
○ Join
○ Remove non-features
● Get real data
● Create a baseline dataset for training
○ Add some basic features
■ Day of week / hour / etc.
○ Write a READABLE CSV that you can start and work with.
22. Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis
○ Anomaly detection - Does a metric suddenly drastically change?
○ Impact analysis - Did deploying a model had a significant effect on metric change?
23. ● Web application framework for R.
○ Introduces user interaction to analysis
○ Combines ad-hoc testing with R statistical / modeling power
● Turns R function wrappers to interactive dashboard elements.
○ Generates HTML, CSS, JS behind the scenes so you only write R.
● Get started
● Get inspired
● Shiny @Waze
Shiny
27. Reduce the problem
● Tradeoff : Time to market VS Loss of accuracy
● Sample data
○ Is random actually what you want?
■ Keep label distributions
■ Keep important features distributions
● Test everything you believe worthy
○ Choose model
○ Choose features (important when you go big)
■ Leave the “boarder” significant ones in
○ Test different parameter configurations (you’ll need to validate your choice later)
Remember : This isn’t your production model. You’re only getting a sense of the data for now.
28. Getting a feel
Exploring a dataset with R.
Dividing data to training and testing.
Random partitioning
29. Getting a feel
Logistic regression and basic variable
selection with R.
Logistic regression
Variable significance test
30. Getting a feel
Advanced variable selection with
regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model
31. Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted
trees model
32. Getting a feel
Modeling bigger data with R, using
parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel
34. Basic moving parts
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
35. Flow motives
● Only 1 job for preprocessing
○ Used in both training and serving - reduces risk of training on wrong population
○ Should also be used before sampling when experimenting on a smaller scale.
○ When data sources are different for training and serving (RT VS Batch for example) use
interfaces!
● Saving training & scoring feature matrices aside
○ Try new algorithms / parameters on the same data
○ Measure changes on same data as used in production.
36. Reusable flow code
Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
SparkSQL UDFs
Implement feature generation -
decouples training and serving
Data cleaning work
37. Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
Reusable flow code
Generate feature matrix
Blackbox from app view
39. Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
○ Easier to read
○ Easier to change code - targeted changes only affect their specific process
○ One input, one output (almost…)
● Easier to tweak and deploy changes
41. @Test
● Suppose to happen throughout development, if not - now is the time to
make sure you have it!
○ Data read correctly
■ Null rates?
○ Features calculated correctly
■ Does my complex join / function / logic return what is should?
○ Access
■ Can I access all the data sources from my “production” account?
○ Formats
■ Adapt for variance in non-structured formats such as JSONs
○ Required Latency
42. Set up a baseline.
Start with a neutral launch
43. ● Take a snapshot of your metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
○ Training takes X hours
○ Serving predictions on Y records takes X seconds
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
45. Optimize
What? How?
● Grid search over
parameters
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Cross validate Everything
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark 1.6
● Tweak training
○ Different models
○ Different model parameters
46. Spark ML
Building an easy to use wrapper
around training and serving.
Build model pipeline, train, evaluate
Not necessarily a random split
47. Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
48. Spark ML
Cross-validate, grid search params and
evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.
49. Spark ML
Score a feature matrix and parse
output.
Get probability for predicted class
(default is a probability vector for all classes)
51. ● Same data, different results
○ Use preprocessed feature matrix (same one used for current model)
● Best testing - production A/B test
○ Use current production model and new model in parallel
● Metrics improvements (Remember your dashboard?)
○ Time series analysis of metrics
○ Compare metrics over different code versions (improves preprocessing / modeling)
● Deploy / Revert = Update user assignments
○ Based on new metrics / feedback loop if possible
Compare to baseline
52. A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files
54. ● Respond to anomalies (alerts) on metric reads
● Try out new stuff
○ Tech versions (e.g. new Spark version)
○ New data sources
○ New features
● When you find something interesting - “Go to Work.”
Constant improvement
Remember : Trends and industries change, re-training on new data is not a bad thing.
56. ● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to compile anything!
Enter Apache Zeppelin
58. Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL
62. Keep in mind
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Algorithmic Richness
● Using Parquet
○ Intermediate outputs
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Output size
○ Coalesce to desired size
● Dataframe Windows - Buggy
○ Write your own over RDD
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory
64. Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trums performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)
74. Server Distribution Optimisation
Calculate the optimal routing servers distribution according to geographical load.
● Better experience - faster response time
● Saves money - no need for redundant elastic scaling of servers
75. Text Mining - Topic Analysis
Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice
wazers usual road social still morgan
eta traffic driving drivers will ang
con stay info reporting update freeman
zona today using helped drive kanan
usando times area nearby delay voice
real clear realtime traffic add meter
tiempo slower sharing jam jammed kan
carretera accident soci drive near masuk
76. Text Mining - New Version Impressions
● Text analysis - stemming / stopword detection etc.
● Topic modeling
● Sentiment analysis
Waze V4 update :
● Good - “redesign”, ”smarter”, “cleaner”, “improved”
● Bad - “stuck”
Overall very positive score!