SlideShare a Scribd company logo
1 of 80
Download to read offline
By
Daniel Marcous
Google, Waze, Data Wizard
dmarcous@gmail/google.com
Big Data Analytics :
Production Ready Flows &
Waze Use Cases
Rules
1. Interactive is interesting.
2. If you got something to say, say!
3. Be open minded - I’m sure I got something to
learn from you, hope you got something to
learn from me.
What’s a Data Wizard you ask?
Gain Actionable Insights!
What’s here?
What’s here?
Methodology
Deploying big models to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
Why Big Data?
Google in just 1 minute:
1000 new
devices
3M Searches 100 Hours
1B Activated
Devices
100M GB
Search
Content
10+ Years of Tackling Big Data Problems
8
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume
Java
Millwheel
Open
Source
2005
Google
Cloud
Products BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache
Beam
Tensorflow
“Google is living a few years in the
future and sending the rest of us
messages”
Doug Cutting, Hadoop Co-Creator
Why Big ML?
Bigger is better
● More processing power
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Keep training until you hit 0
○ Some models can not overfit when
optimising until training error is 0.
■ RF - more trees
■ ANN - more iterations
● Handle BIG data
○ Tons of training data (if you have it) - no
need for sampling on wrong populations!
○ Millions of features? Easy… (text
processing with TF-IDF)
○ Some models (ANN) can’t do good without
training on a lot of data.
Challenges
Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings
○ Different metric readings
■ Different implementations (distributed VS central memory)
■ Different programming language (heuristics)
○ Different populations trained on (sampling)
Solution = Workflow
Measure first, optimize
second.
Before you start
● Create example input
○ Raw input
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ Coverage
■ Amount of subjects affected
■ Sometimes measures as average
precision per K random subjects.
Remember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
● Create example output
○ Featured input
○ Prediction rows
Naive
Matrix
1
1
2
3
3
Preprocess
● Naive feature matrix
○ Parse (Text -> RDD[Object] -> DataFrame)
○ Clean (remove outliers / bad records)
○ Join
○ Remove non-features
● Get real data
● Create a baseline dataset for training
○ Add some basic features
■ Day of week / hour / etc.
○ Write a READABLE CSV that you can start and work with.
Preprocess
Case Class RDD to DataFrame
RDD[String] to Case Class RDD
String row to object
Preprocess
Parse string to Object with
java.sql types
Metric Generation
Craft useful metrics.
Pre class metrics
Confusion matrix by hand
Monitor.
Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis
○ Anomaly detection - Does a metric suddenly drastically change?
○ Impact analysis - Did deploying a model had a significant effect on metric change?
● Web application framework for R.
○ Introduces user interaction to analysis
○ Combines ad-hoc testing with R statistical / modeling power
● Turns R function wrappers to interactive dashboard elements.
○ Generates HTML, CSS, JS behind the scenes so you only write R.
● Get started
● Get inspired
● Shiny @Waze
Shiny
Dashboard
monitoring
Dashboard should support - picking
different models, comparing metrics.
Pick models
to compare
Statistical tests on distributions
t.test / AUC
Dashboard
monitoring
Dashboard should support -
Timeseries anomaly detection, and
impact analysis (deploying new model)
Start small and grow.
Reduce the problem
● Tradeoff : Time to market VS Loss of accuracy
● Sample data
○ Is random actually what you want?
■ Keep label distributions
■ Keep important features distributions
● Test everything you believe worthy
○ Choose model
○ Choose features (important when you go big)
■ Leave the “boarder” significant ones in
○ Test different parameter configurations (you’ll need to validate your choice later)
Remember : This isn’t your production model. You’re only getting a sense of the data for now.
Getting a feel
Exploring a dataset with R.
Dividing data to training and testing.
Random partitioning
Getting a feel
Logistic regression and basic variable
selection with R.
Logistic regression
Variable significance test
Getting a feel
Advanced variable selection with
regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model
Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted
trees model
Getting a feel
Modeling bigger data with R, using
parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel
Start with a flow.
Basic moving parts
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Flow motives
● Only 1 job for preprocessing
○ Used in both training and serving - reduces risk of training on wrong population
○ Should also be used before sampling when experimenting on a smaller scale.
○ When data sources are different for training and serving (RT VS Batch for example) use
interfaces!
● Saving training & scoring feature matrices aside
○ Try new algorithms / parameters on the same data
○ Measure changes on same data as used in production.
Reusable flow code
Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
SparkSQL UDFs
Implement feature generation -
decouples training and serving
Data cleaning work
Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
Reusable flow code
Generate feature matrix
Blackbox from app view
Good ML code trumps
performance.
Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
○ Easier to read
○ Easier to change code - targeted changes only affect their specific process
○ One input, one output (almost…)
● Easier to tweak and deploy changes
Test your infrastructure.
@Test
● Suppose to happen throughout development, if not - now is the time to
make sure you have it!
○ Data read correctly
■ Null rates?
○ Features calculated correctly
■ Does my complex join / function / logic return what is should?
○ Access
■ Can I access all the data sources from my “production” account?
○ Formats
■ Adapt for variance in non-structured formats such as JSONs
○ Required Latency
Set up a baseline.
Start with a neutral launch
● Take a snapshot of your metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
○ Training takes X hours
○ Serving predictions on Y records takes X seconds
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
Go to work.
Coffee recommended at this point.
Optimize
What? How?
● Grid search over
parameters
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Cross validate Everything
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark 1.6
● Tweak training
○ Different models
○ Different model parameters
Spark ML
Building an easy to use wrapper
around training and serving.
Build model pipeline, train, evaluate
Not necessarily a random split
Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
Spark ML
Cross-validate, grid search params and
evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.
Spark ML
Score a feature matrix and parse
output.
Get probability for predicted class
(default is a probability vector for all classes)
A/B
Test your changes
● Same data, different results
○ Use preprocessed feature matrix (same one used for current model)
● Best testing - production A/B test
○ Use current production model and new model in parallel
● Metrics improvements (Remember your dashboard?)
○ Time series analysis of metrics
○ Compare metrics over different code versions (improves preprocessing / modeling)
● Deploy / Revert = Update user assignments
○ Based on new metrics / feedback loop if possible
Compare to baseline
A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files
Watch. Iterate.
● Respond to anomalies (alerts) on metric reads
● Try out new stuff
○ Tech versions (e.g. new Spark version)
○ New data sources
○ New features
● When you find something interesting - “Go to Work.”
Constant improvement
Remember : Trends and industries change, re-training on new data is not a bad thing.
Ad-Hoc statistics
● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to compile anything!
Enter Apache Zeppelin
Playing with it
Setting up zeppelin to user our jars.
Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL
Playing with it
Using spark-csv by Databricks.
CSV to DataFrame by
Databricks
Using user compiled code inside a
notebook.
Playing with it
Bring your own code
Technological pitfalls
Keep in mind
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Algorithmic Richness
● Using Parquet
○ Intermediate outputs
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Output size
○ Coalesce to desired size
● Dataframe Windows - Buggy
○ Write your own over RDD
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory
Putting it all together
Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trums performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)
Code:
https://github.com/dmarcous/BigMLFlow/
Slides:
http://www.slideshare.net/DanielMarcous/productionready
-big-ml-workflows-from-zero-to-hero
Use Cases
What Waze does with all its data?
Trending Locations / Day of Week Breakdown
Opening Hours Inference
Optimising - Ad clicks / Time from drive start
Time to Content (US) - Day of week / Category
Irregular Events / Anomaly Detection
Major events, causing out of the ordinary traffic/road blocks etc’ affecting large
numbers of users.
Dangerous Places - Clustering
Find most dangerous areas / streets, using custom developed clustering algorithms
● Alert authorities / users
● Compare & share with 3rd parties (NYPD)
Parking Places Detection
Parking entrance
Parking lot
Street parking
Server Distribution Optimisation
Calculate the optimal routing servers distribution according to geographical load.
● Better experience - faster response time
● Saves money - no need for redundant elastic scaling of servers
Text Mining - Topic Analysis
Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice
wazers usual road social still morgan
eta traffic driving drivers will ang
con stay info reporting update freeman
zona today using helped drive kanan
usando times area nearby delay voice
real clear realtime traffic add meter
tiempo slower sharing jam jammed kan
carretera accident soci drive near masuk
Text Mining - New Version Impressions
● Text analysis - stemming / stopword detection etc.
● Topic modeling
● Sentiment analysis
Waze V4 update :
● Good - “redesign”, ”smarter”, “cleaner”, “improved”
● Bad - “stuck”
Overall very positive score!
Text Mining - Store Sentiments
Text Mining - Sentiment by Time & Place
Daniel Marcous
dmarcous@google.com
dmarcous@gmail.com

More Related Content

What's hot

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeGlobant
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)Albert Wong
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 

What's hot (20)

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
The Big Bad Data
The Big Bad DataThe Big Bad Data
The Big Bad Data
 
Towards Data Operations
Towards Data OperationsTowards Data Operations
Towards Data Operations
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)2013 DATA @ NFLX (Tableau User Group)
2013 DATA @ NFLX (Tableau User Group)
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 

Viewers also liked

Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleDaniel Marcous
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Daniel Marcous
 
Waze Partnership
Waze PartnershipWaze Partnership
Waze PartnershipEd Blayney
 
How data is renewing and reshaping rio de janeiro
How data is renewing and reshaping rio de janeiroHow data is renewing and reshaping rio de janeiro
How data is renewing and reshaping rio de janeiroPablo Cerdeira
 
Intelligent Transportation Systems for a Smart City
Intelligent Transportation Systems for a Smart City Intelligent Transportation Systems for a Smart City
Intelligent Transportation Systems for a Smart City Charles Mok
 
Social Commerce Summit
Social Commerce SummitSocial Commerce Summit
Social Commerce SummitAndy Ellwood
 
Waze Overview
Waze OverviewWaze Overview
Waze OverviewDee E
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
Resistance in technology change
Resistance in technology changeResistance in technology change
Resistance in technology changeMahendra Gehlot
 
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]HUB INSTITUTE
 
CONTRTATO. CASO EJEMPLAR
CONTRTATO. CASO EJEMPLARCONTRTATO. CASO EJEMPLAR
CONTRTATO. CASO EJEMPLARhector bolivar
 
Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loopJosh Baer
 

Viewers also liked (17)

Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @Google
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)
 
Waze Partnership
Waze PartnershipWaze Partnership
Waze Partnership
 
How data is renewing and reshaping rio de janeiro
How data is renewing and reshaping rio de janeiroHow data is renewing and reshaping rio de janeiro
How data is renewing and reshaping rio de janeiro
 
WAZE
WAZE WAZE
WAZE
 
Intelligent Transportation Systems for a Smart City
Intelligent Transportation Systems for a Smart City Intelligent Transportation Systems for a Smart City
Intelligent Transportation Systems for a Smart City
 
Google Waze Ads
Google Waze AdsGoogle Waze Ads
Google Waze Ads
 
Social Commerce Summit
Social Commerce SummitSocial Commerce Summit
Social Commerce Summit
 
Waze Overview
Waze OverviewWaze Overview
Waze Overview
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Resistance in technology change
Resistance in technology changeResistance in technology change
Resistance in technology change
 
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
 
CONTRTATO. CASO EJEMPLAR
CONTRTATO. CASO EJEMPLARCONTRTATO. CASO EJEMPLAR
CONTRTATO. CASO EJEMPLAR
 
business model of waze
business model of wazebusiness model of waze
business model of waze
 
Conceptos de neurosis, líbido y trauma
Conceptos de neurosis, líbido y traumaConceptos de neurosis, líbido y trauma
Conceptos de neurosis, líbido y trauma
 
Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loop
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 

Similar to Production-Ready BIG ML Workflows - from zero to hero

End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflowsAdam Gibson
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Aaron Saray
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path ForwardDan Mallinger
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 

Similar to Production-Ready BIG ML Workflows - from zero to hero (20)

A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
BIG Data Science: A Path Forward
BIG Data Science:  A Path ForwardBIG Data Science:  A Path Forward
BIG Data Science: A Path Forward
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 

More from Daniel Marcous

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILDaniel Marcous
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Daniel Marcous
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDaniel Marcous
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETADaniel Marcous
 

More from Daniel Marcous (6)

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle IL
 
S2
S2S2
S2
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
 
Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
 

Recently uploaded

IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Recently uploaded (20)

IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Production-Ready BIG ML Workflows - from zero to hero

  • 1. By Daniel Marcous Google, Waze, Data Wizard dmarcous@gmail/google.com Big Data Analytics : Production Ready Flows & Waze Use Cases
  • 2. Rules 1. Interactive is interesting. 2. If you got something to say, say! 3. Be open minded - I’m sure I got something to learn from you, hope you got something to learn from me.
  • 3. What’s a Data Wizard you ask? Gain Actionable Insights!
  • 5. What’s here? Methodology Deploying big models to production - step by step Pitfalls What to look out for in both methodology and code Use Cases Showing off what we actually do in Waze Analytics Based on tough lessons learned & Google experts recommendations and inputs.
  • 7. Google in just 1 minute: 1000 new devices 3M Searches 100 Hours 1B Activated Devices 100M GB Search Content
  • 8. 10+ Years of Tackling Big Data Problems 8 Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Flume Java Millwheel Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable BigTable Dremel PubSub Apache Beam Tensorflow
  • 9. “Google is living a few years in the future and sending the rest of us messages” Doug Cutting, Hadoop Co-Creator
  • 11. Bigger is better ● More processing power ○ Grid search all the parameters you ever wanted. ○ Cross validate in parallel with no extra effort. ● Keep training until you hit 0 ○ Some models can not overfit when optimising until training error is 0. ■ RF - more trees ■ ANN - more iterations ● Handle BIG data ○ Tons of training data (if you have it) - no need for sampling on wrong populations! ○ Millions of features? Easy… (text processing with TF-IDF) ○ Some models (ANN) can’t do good without training on a lot of data.
  • 13. Bigger is harder ● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R) ● Curse of dimensionality ○ Some algorithms require exponential time/memory as dimensions grow ○ Harder and more important to tell what’s gold and what’s noise ○ Unbalance data goes a long way with more records ● Big model != Small model ○ Different parameter settings ○ Different metric readings ■ Different implementations (distributed VS central memory) ■ Different programming language (heuristics) ○ Different populations trained on (sampling)
  • 16. Before you start ● Create example input ○ Raw input ● Set up your metrics ○ Derived from business needs ○ Confusion matrix ○ Precision / recall ■ Per class metrics ○ AUC ○ Coverage ■ Amount of subjects affected ■ Sometimes measures as average precision per K random subjects. Remember : Desired short term behaviour does not imply long term behaviour Measure Preprocess (parse, clean, join, etc.) ● Create example output ○ Featured input ○ Prediction rows Naive Matrix 1 1 2 3 3
  • 17. Preprocess ● Naive feature matrix ○ Parse (Text -> RDD[Object] -> DataFrame) ○ Clean (remove outliers / bad records) ○ Join ○ Remove non-features ● Get real data ● Create a baseline dataset for training ○ Add some basic features ■ Day of week / hour / etc. ○ Write a READABLE CSV that you can start and work with.
  • 18. Preprocess Case Class RDD to DataFrame RDD[String] to Case Class RDD String row to object
  • 19. Preprocess Parse string to Object with java.sql types
  • 20. Metric Generation Craft useful metrics. Pre class metrics Confusion matrix by hand
  • 22. Visualise - easiest way to measure quickly ● Set up your dashboard ○ Amounts of input data ■ Before /after joining ○ Amounts of output data ○ Metrics (See “Measure first, optimize second”) ● Different model comparison - what’s best, when and where ● Timeseries Analysis ○ Anomaly detection - Does a metric suddenly drastically change? ○ Impact analysis - Did deploying a model had a significant effect on metric change?
  • 23. ● Web application framework for R. ○ Introduces user interaction to analysis ○ Combines ad-hoc testing with R statistical / modeling power ● Turns R function wrappers to interactive dashboard elements. ○ Generates HTML, CSS, JS behind the scenes so you only write R. ● Get started ● Get inspired ● Shiny @Waze Shiny
  • 24. Dashboard monitoring Dashboard should support - picking different models, comparing metrics. Pick models to compare Statistical tests on distributions t.test / AUC
  • 25. Dashboard monitoring Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)
  • 27. Reduce the problem ● Tradeoff : Time to market VS Loss of accuracy ● Sample data ○ Is random actually what you want? ■ Keep label distributions ■ Keep important features distributions ● Test everything you believe worthy ○ Choose model ○ Choose features (important when you go big) ■ Leave the “boarder” significant ones in ○ Test different parameter configurations (you’ll need to validate your choice later) Remember : This isn’t your production model. You’re only getting a sense of the data for now.
  • 28. Getting a feel Exploring a dataset with R. Dividing data to training and testing. Random partitioning
  • 29. Getting a feel Logistic regression and basic variable selection with R. Logistic regression Variable significance test
  • 30. Getting a feel Advanced variable selection with regularisation techniques in R. Intercepts - by significance No intercept = not entered to model
  • 31. Getting a feel Trying modeling techniques in R. Root mean square error Lower = better (~ kinda) Fit a gradient boosted trees model
  • 32. Getting a feel Modeling bigger data with R, using parallelism. Fit and combine 6 random forest models (10k trees each) in parallel
  • 33. Start with a flow.
  • 34. Basic moving parts Data source 1 Data source N Preprocess Training Feature matrix Scoring Models 1..N Predictions 1..N Dashboard Serving DB Feedback loop Conf. User/Model assignments
  • 35. Flow motives ● Only 1 job for preprocessing ○ Used in both training and serving - reduces risk of training on wrong population ○ Should also be used before sampling when experimenting on a smaller scale. ○ When data sources are different for training and serving (RT VS Batch for example) use interfaces! ● Saving training & scoring feature matrices aside ○ Try new algorithms / parameters on the same data ○ Measure changes on same data as used in production.
  • 36. Reusable flow code Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes. SparkSQL UDFs Implement feature generation - decouples training and serving Data cleaning work
  • 37. Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes. Reusable flow code Generate feature matrix Blackbox from app view
  • 38. Good ML code trumps performance.
  • 39. Why so many parts you ask? ● Scaling ● Fault tolerance ○ Failed preprocessing /training doesn’t affect serving model ○ Rerunning only failed parts ● Different logical parts - Different processes (@”Clean code” by Uncle Bob) ○ Easier to read ○ Easier to change code - targeted changes only affect their specific process ○ One input, one output (almost…) ● Easier to tweak and deploy changes
  • 41. @Test ● Suppose to happen throughout development, if not - now is the time to make sure you have it! ○ Data read correctly ■ Null rates? ○ Features calculated correctly ■ Does my complex join / function / logic return what is should? ○ Access ■ Can I access all the data sources from my “production” account? ○ Formats ■ Adapt for variance in non-structured formats such as JSONs ○ Required Latency
  • 42. Set up a baseline. Start with a neutral launch
  • 43. ● Take a snapshot of your metric reads: ○ The ones you chose earlier in the process as important to you ■ Confusion matrix ■ Weighted average % classified correctly ■ % subject coverage ● Latency ○ Building feature matrix on last day data takes X minutes ○ Training takes X hours ○ Serving predictions on Y records takes X seconds You are here: Remember : You are running with a naive model. Everything better than the old model / random is OK.
  • 44. Go to work. Coffee recommended at this point.
  • 45. Optimize What? How? ● Grid search over parameters ● Evaluate metrics ○ Using a Spark predefined Evaluator ○ Using user defined metrics ● Cross validate Everything ● Tweak preprocessing (mainly features) ○ Feature engineering ○ Feature transformers ■ Discretize / Normalise ○ Feature selectors ○ In Apache Spark 1.6 ● Tweak training ○ Different models ○ Different model parameters
  • 46. Spark ML Building an easy to use wrapper around training and serving. Build model pipeline, train, evaluate Not necessarily a random split
  • 47. Spark ML Building a training pipeline with spark.ml. Create dummy variables Required response label format The ML model itself Labels back to readable format Assembled training pipeline
  • 48. Spark ML Cross-validate, grid search params and evaluate metrics. Grid search with reference to ML model stage (RF) Metrics to evaluate Yes, you can definitely extend and add your own metrics.
  • 49. Spark ML Score a feature matrix and parse output. Get probability for predicted class (default is a probability vector for all classes)
  • 51. ● Same data, different results ○ Use preprocessed feature matrix (same one used for current model) ● Best testing - production A/B test ○ Use current production model and new model in parallel ● Metrics improvements (Remember your dashboard?) ○ Time series analysis of metrics ○ Compare metrics over different code versions (improves preprocessing / modeling) ● Deploy / Revert = Update user assignments ○ Based on new metrics / feedback loop if possible Compare to baseline
  • 52. A/B Infrastructures Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper. Conf hold Mapping of: model -> user_id/subject list Score in parallel (inside a map) Distributed=awesome. Fancy scala union for all score files
  • 54. ● Respond to anomalies (alerts) on metric reads ● Try out new stuff ○ Tech versions (e.g. new Spark version) ○ New data sources ○ New features ● When you find something interesting - “Go to Work.” Constant improvement Remember : Trends and industries change, re-training on new data is not a bad thing.
  • 56. ● If you wrote your code right, you can easily reuse it in a notebook ! ● Answer ad-hoc questions ○ How many predictions did you output last month? ○ How many new users had a prediction with probability > 0.7 ○ How accurate were we on last month predictions? (join with real data) ● No need to compile anything! Enter Apache Zeppelin
  • 57. Playing with it Setting up zeppelin to user our jars.
  • 58. Playing with it Read a parquet file , show statistics, register as table and run SparkSQL on it. Parquet - already has a schema inside For usage in SparkSQL
  • 59. Playing with it Using spark-csv by Databricks. CSV to DataFrame by Databricks
  • 60. Using user compiled code inside a notebook. Playing with it Bring your own code
  • 62. Keep in mind ● Code produced with ○ Apache Spark 1.6 / Scala 2.11.4 ● RDD VS Dataframe ○ Enter “Dataset API” (V2.0+) ● mllib VS spark.ml ○ Always use spark.ml if functionality exists ● Algorithmic Richness ● Using Parquet ○ Intermediate outputs ● Unbalanced partitions ○ Stuck on reduce ○ Stragglers ● Output size ○ Coalesce to desired size ● Dataframe Windows - Buggy ○ Write your own over RDD ● Parameter tuning ○ Spark.sql.partitions ○ Executors ○ Driver VS executor memory
  • 63. Putting it all together
  • 64. Work Process Step by step for deploying your big ML workflows to production, ready for operations and optimisations. 1. Measure first, optimize second. a. Define metrics. b. Preprocess data (using examples) c. Monitor. (dashboard setup) 2. Start small and grow. 3. Start with a flow. a. Good ML code trums performance. b. Test your infrastructure. 4. Set up a baseline. 5. Go to work. a. Optimize. b. A/B. i. Test new flow in parallel to existing flow. ii. Update user assignments. 6. Watch. Iterate. (see 5.)
  • 66. Use Cases What Waze does with all its data?
  • 67. Trending Locations / Day of Week Breakdown
  • 69. Optimising - Ad clicks / Time from drive start
  • 70. Time to Content (US) - Day of week / Category
  • 71. Irregular Events / Anomaly Detection Major events, causing out of the ordinary traffic/road blocks etc’ affecting large numbers of users.
  • 72. Dangerous Places - Clustering Find most dangerous areas / streets, using custom developed clustering algorithms ● Alert authorities / users ● Compare & share with 3rd parties (NYPD)
  • 73. Parking Places Detection Parking entrance Parking lot Street parking
  • 74. Server Distribution Optimisation Calculate the optimal routing servers distribution according to geographical load. ● Better experience - faster response time ● Saves money - no need for redundant elastic scaling of servers
  • 75. Text Mining - Topic Analysis Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice wazers usual road social still morgan eta traffic driving drivers will ang con stay info reporting update freeman zona today using helped drive kanan usando times area nearby delay voice real clear realtime traffic add meter tiempo slower sharing jam jammed kan carretera accident soci drive near masuk
  • 76. Text Mining - New Version Impressions ● Text analysis - stemming / stopword detection etc. ● Topic modeling ● Sentiment analysis Waze V4 update : ● Good - “redesign”, ”smarter”, “cleaner”, “improved” ● Bad - “stuck” Overall very positive score!
  • 77. Text Mining - Store Sentiments
  • 78. Text Mining - Sentiment by Time & Place
  • 79.