SlideShare a Scribd company logo
1 of 21
Google Confidential and Proprietary
Ad Click Prediction:
a View from the Trenches
Presented by Brendan McMahan
H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young,
Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov,
Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson,
Tom Boulos, Jeremy Kubica
KDD 2013
Google Confidential and Proprietary
Web Search
Advertising
KDD 2013
Google AdWordsAds
Targeted ads shown in
response to a specific
user query on a search
engine.
Many billions of dollars a
year are spent on web
advertising.
Google Confidential and Proprietary
Ad Click Prediction
Web search ads align incentives:
● The search engine only gets paid
if the user clicks the ad
● So, the search engine only wants
to show relevant ads
● and user only wants to see
relevant ads.
Predicting the probability of a click
for a specific ad in response to a
specific query is the key ingredient!
KDD 2013
GoogleAdW
System Overview Similar to Downpour SGD
Google Confidential and Proprietary
KDD 2013
Training massive models on massive data with
minimum resources:
● billions of unique features (model coefficients)
● billions of predictions per day serving live traffic
● billions of training examples
The FTRL-Proximal Online Learning Algorithm
"Like Stochastic Gradient Descent, but much sparser models."
● Online algorithms: simple, scalable, rich theory
● FTRL-Proximal is equivalent to Online (Stochastic) Gradient
Descent when no regularization is used [1]
● Implements regularization in a way similar to RDA [2], but gives
better accuracy in our experiments. The key is re-expressing
gradient descent implicitly as:
● With a few tricks, about as easy to implement as gradient descent
(a few lines of code, pseudo-code in the paper)
1 Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. McMahan, AISTATS 2011.
2 Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Lin Xiao, JMLR 2010.
Google Confidential and Proprietary
KDD 2013
Per-Coordinate Learning Rates: Intuition
Consider predicting for:
● English queries in the US (1,000,000 training examples)
●Icelandic queries in Iceland (100 training examples)
Imagine the features are disjoint.
If we trained separate SGD models for each, we would use a very
different learning rate for each: generally ~ 1/sqrt(# examples)
Train a single model:
concatenate the feature vectors, interleave training examples
There is no good learning rate choice! SGD will either:
● never converge for Iceland (with a low learning rate)
●or oscillate wildly for the US (with a large learning rate)
There is a middle ground, but it is still suboptimal.
Google Confidential and Proprietary
KDD 2013
Per-Coordinate Learning Rates: Implementation
● Theory indicates the learning rate
● Requires storing one extra statistic per coordinate
○ But we have a trick for reducing this if you are training multiple similar
models
● Huge accuracy improvement:
○ Improved AUC by 11.2% versus a global learning rate baseline
○ (In our setting a 1% improvement is large)
References:
Adaptive Bound Optimization for Online Convex Optimization. McMahan and Streeter, COLT 2010.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . Duchi, Hazan, and Singer, JMLR 2011.
Google Confidential and Proprietary
KDD 2013
Techniques for Saving Memory
● L1 regularization
● Randomized rounding
● Probabilistic feature inclusion
● Grouping similar models
● Randomized rounding
During training, we track some statistics for features not included in the
final model (zero coefficients due to L1 regularization)
Saving Training Memory Saving Memory at Serving
Google Confidential and Proprietary
KDD 2013
FTRL-Proximal with L1 Regularization
Tradeoff frontier for a small (106 examples) dataset
Google Confidential and Proprietary
KDD 2013
Smaller models /
Better Accuracy
Each line varies the L1 parameter for
different size/accuracy tradeoffs
FTRL-Proximal with L1 Regularization
Full-scale experiment
● Baseline algorithm: Only include features that occur at least K
times.
○ Tuned the L1 parameter for FTRL-Proximal to provide a good
sparsity/accuracy tradeoff.
○ Then tuned K to get the same accuracy with the baseline.
● Baseline model has about 3x as many non-zero entries
compared to FTRL-Proximal with L1 regularization
Google Confidential and Proprietary
KDD 2013
Probabilistic Feature Inclusion
● Long-tailed distributions produce many features that only occur a
handful of times. These generally aren't useful for prediction.
● So, how do you tell these rare features from common ones without
tracking any state? Two probabilistic techniques:
○ Poisson inclusion: When we see a feature not in the model,
add it with probability p.
○ Bloom-filter inclusion: use a rolling set of counting Bloom filters
to count occurrences (include after K, but sometimes includes a
feature that has occurred fewer than K times).
In some of our models, half of the unique
features occur only once in a training set
of billions of examples.
Google Confidential and Proprietary
KDD 2013
Storing Coefficients with Fewer Bits
The learning rate tells us how accurately we might know the
coefficient.
● If training on one example might change the coefficient by Δ, you
don't know the value with more precision than Δ.
● So, don't waste memory storing a full-precision floating-point
value! (32-64 bits)
● Adaptive schemes are possible
● But in practice, storing a 16-bit fixed point representation is easy
and works very well.
But, you need to be careful about accumulating errors. The trick:
randomized rounding
Large-Scale Learning with Less RAM via Randomization. Golovin, Sculley, McMahan, Young. ICML 2013.
Google Confidential and Proprietary
KDD 2013
Training with Randomized Rounding
Compute the usual update
(Gradient Descent or
FTRL-Proximal) at full
precision, then randomly
project to an adjacent
value expressible in the
fixed-point encoding.
Use unbiased rounding:
E[rounded coefficient]
= unrounded coefficient
No accuracy loss with a 16-bit fixed-point representation,
compared to 32- or 64 bit floating point values (and 50-75% less memory)
Google Confidential and Proprietary
KDD 2013
Training Many Similar Models: Individually
Training 4 similar models, each model trained separately
with coefficients stored in a hash table.
Each model uses say 4+4+2 = 10 bytes per coefficient
Feature Key
(e.g., 32-64 bit
hash)
Gradient-
Squared
Sum
Model 0
Coefficient
(q2.13)
Feature Key
(e.g., 32-64 bit
hash)
Gradient-
Squared
Sum
Model 2
Coefficient
(q2.13)
Google Confidential and Proprietary
KDD 2013
Feature Key
(e.g., 32-64 bit
hash)
Gradient-
Squared
Sum
Model 1
Coefficient
(q2.13)
Feature Key
(e.g., 32-64 bit
hash)
Gradient-
Squared
Sum
Model 3
Coefficient
(q2.13)
Why many similar models? We may want to try different feature
variations, learning rates, regularization strengths, ...
Training Many Similar Models: Grouped
Data for multiple similar models stored in a single hash table
Only 2 bytes per coefficient (plus 12 bytes of overhead) per model
Feature Key
(e.g., 32-64 bit
hash)
# Positive
Examples
# Negative
Examples
Model 0
Coefficient
(q2.13)
Model 1
Coefficient
(q2.13)
Model 2
Coefficient
(q2.13)
Model 3
Coefficient
(q2.13)
Common data is only stored
once, so cost (memory, cpu,
and network) is amortized.
The only cost per model
variant is for storing the
model's coefficient, which
can be small (say 16 bit q2.
13 fixed point).
Rather than using exact per-coordinate learning rates based on each
model's gradients, we use a (good enough) approximation based on
the # of observed positive and negative examples.
Google Confidential and Proprietary
KDD 2013
Computing Learning Rates with Counts
The previous trick requires using the same learning-rate statistic for all
grouped models. Approximate the theory-recommended learning-rate
schedule
using only counts:
N is the number of Negative examples
P is the number of Positive examples
Google Confidential and Proprietary
KDD 2013
High-Dimensional Data Visualization for Model Accuracy
We don't just care about aggregate accuracy!
● Progressive validation: compute prediction on each example before
training on it
● Look at relative changes in accuracy between models
● Data visualization to quickly spot outliers or problem areas
Google Confidential and Proprietary
KDD 2013
High-Dimensional Data Visualization for Model Accuracy
Google Confidential and Proprietary
KDD 2013
Automated Feature Management System
Many raw signals: words-in-user-query, country-of-origin, etc
Lots of engineers working on signals, multiple different learning
platforms / teams consuming them.
A metadata index to manage all these signals:
● what systems are using what signals
● what signals are available to different systems
● status and version information / deprecation
● whitelists for production readiness / test status
Used for automatic testing, alerting, notification, cleanup of unused
signals, etc.
Google Confidential and Proprietary
KDD 2013
Also in the paper …
● Fast approximate training with a single value structure
● Unbiased training on a biased subsample of training data
● Assessing model accuracy with progressive validation
● Cheap confidence estimates
● Approaches for calibrating predictions
Techniques that didn't work as well as expected
● Aggressive feature hashing
● Randomized feature dropout
● Averaging models trained on different subsets of the features
● Feature vector normalization
Google Confidential and Proprietary
KDD 2013
Thanks for listening!
And thanks to all my co-authors:
Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady,
Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat
Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom
Boulos, Jeremy Kubica
Google Confidential and Proprietary
KDD 2013

More Related Content

What's hot

Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
Graph Gurus 23: Best Practices To Model Your Data Using A Graph DatabaseGraph Gurus 23: Best Practices To Model Your Data Using A Graph Database
Graph Gurus 23: Best Practices To Model Your Data Using A Graph DatabaseTigerGraph
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
 
Graph Gurus Episode 31: GSQL Writing Best Practices Part 1
Graph Gurus Episode 31: GSQL Writing Best Practices Part 1Graph Gurus Episode 31: GSQL Writing Best Practices Part 1
Graph Gurus Episode 31: GSQL Writing Best Practices Part 1TigerGraph
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...TigerGraph
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hypeJorge Ferrer
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Ad science bid simulator (public ver)
Ad science bid simulator (public ver)Ad science bid simulator (public ver)
Ad science bid simulator (public ver)Marsan Ma
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityTigerGraph
 

What's hot (20)

Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
Graph Gurus 23: Best Practices To Model Your Data Using A Graph DatabaseGraph Gurus 23: Best Practices To Model Your Data Using A Graph Database
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Graph Gurus Episode 31: GSQL Writing Best Practices Part 1
Graph Gurus Episode 31: GSQL Writing Best Practices Part 1Graph Gurus Episode 31: GSQL Writing Best Practices Part 1
Graph Gurus Episode 31: GSQL Writing Best Practices Part 1
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
Graph Gurus Episode 28: In-Database Machine Learning Solution for Real-Time R...
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Machine learning the next revolution or just another hype
Machine learning   the next revolution or just another hypeMachine learning   the next revolution or just another hype
Machine learning the next revolution or just another hype
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Ad science bid simulator (public ver)
Ad science bid simulator (public ver)Ad science bid simulator (public ver)
Ad science bid simulator (public ver)
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 CentralityUsing Graph Algorithms for Advanced Analytics - Part 2 Centrality
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
 

Similar to Kdd 2013 talk-converted

Ad Click Prediction - Paper review
Ad Click Prediction - Paper reviewAd Click Prediction - Paper review
Ad Click Prediction - Paper reviewMazen Aly
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning SystemsThoughtworks
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsBigML, Inc
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human labelKai-Wen Zhao
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...PATHALAMRAJESH
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachAjit Ghodke
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5TigerGraph
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarSigOpt
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflowsAdam Gibson
 

Similar to Kdd 2013 talk-converted (20)

Ad Click Prediction - Paper review
Ad Click Prediction - Paper reviewAd Click Prediction - Paper review
Ad Click Prediction - Paper review
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
C3 w5
C3 w5C3 w5
C3 w5
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques Webinar
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 

Recently uploaded

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 

Recently uploaded (20)

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 

Kdd 2013 talk-converted

  • 1. Google Confidential and Proprietary Ad Click Prediction: a View from the Trenches Presented by Brendan McMahan H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica KDD 2013
  • 2. Google Confidential and Proprietary Web Search Advertising KDD 2013 Google AdWordsAds Targeted ads shown in response to a specific user query on a search engine. Many billions of dollars a year are spent on web advertising.
  • 3. Google Confidential and Proprietary Ad Click Prediction Web search ads align incentives: ● The search engine only gets paid if the user clicks the ad ● So, the search engine only wants to show relevant ads ● and user only wants to see relevant ads. Predicting the probability of a click for a specific ad in response to a specific query is the key ingredient! KDD 2013 GoogleAdW
  • 4. System Overview Similar to Downpour SGD Google Confidential and Proprietary KDD 2013 Training massive models on massive data with minimum resources: ● billions of unique features (model coefficients) ● billions of predictions per day serving live traffic ● billions of training examples
  • 5. The FTRL-Proximal Online Learning Algorithm "Like Stochastic Gradient Descent, but much sparser models." ● Online algorithms: simple, scalable, rich theory ● FTRL-Proximal is equivalent to Online (Stochastic) Gradient Descent when no regularization is used [1] ● Implements regularization in a way similar to RDA [2], but gives better accuracy in our experiments. The key is re-expressing gradient descent implicitly as: ● With a few tricks, about as easy to implement as gradient descent (a few lines of code, pseudo-code in the paper) 1 Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. McMahan, AISTATS 2011. 2 Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Lin Xiao, JMLR 2010. Google Confidential and Proprietary KDD 2013
  • 6. Per-Coordinate Learning Rates: Intuition Consider predicting for: ● English queries in the US (1,000,000 training examples) ●Icelandic queries in Iceland (100 training examples) Imagine the features are disjoint. If we trained separate SGD models for each, we would use a very different learning rate for each: generally ~ 1/sqrt(# examples) Train a single model: concatenate the feature vectors, interleave training examples There is no good learning rate choice! SGD will either: ● never converge for Iceland (with a low learning rate) ●or oscillate wildly for the US (with a large learning rate) There is a middle ground, but it is still suboptimal. Google Confidential and Proprietary KDD 2013
  • 7. Per-Coordinate Learning Rates: Implementation ● Theory indicates the learning rate ● Requires storing one extra statistic per coordinate ○ But we have a trick for reducing this if you are training multiple similar models ● Huge accuracy improvement: ○ Improved AUC by 11.2% versus a global learning rate baseline ○ (In our setting a 1% improvement is large) References: Adaptive Bound Optimization for Online Convex Optimization. McMahan and Streeter, COLT 2010. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . Duchi, Hazan, and Singer, JMLR 2011. Google Confidential and Proprietary KDD 2013
  • 8. Techniques for Saving Memory ● L1 regularization ● Randomized rounding ● Probabilistic feature inclusion ● Grouping similar models ● Randomized rounding During training, we track some statistics for features not included in the final model (zero coefficients due to L1 regularization) Saving Training Memory Saving Memory at Serving Google Confidential and Proprietary KDD 2013
  • 9. FTRL-Proximal with L1 Regularization Tradeoff frontier for a small (106 examples) dataset Google Confidential and Proprietary KDD 2013 Smaller models / Better Accuracy Each line varies the L1 parameter for different size/accuracy tradeoffs
  • 10. FTRL-Proximal with L1 Regularization Full-scale experiment ● Baseline algorithm: Only include features that occur at least K times. ○ Tuned the L1 parameter for FTRL-Proximal to provide a good sparsity/accuracy tradeoff. ○ Then tuned K to get the same accuracy with the baseline. ● Baseline model has about 3x as many non-zero entries compared to FTRL-Proximal with L1 regularization Google Confidential and Proprietary KDD 2013
  • 11. Probabilistic Feature Inclusion ● Long-tailed distributions produce many features that only occur a handful of times. These generally aren't useful for prediction. ● So, how do you tell these rare features from common ones without tracking any state? Two probabilistic techniques: ○ Poisson inclusion: When we see a feature not in the model, add it with probability p. ○ Bloom-filter inclusion: use a rolling set of counting Bloom filters to count occurrences (include after K, but sometimes includes a feature that has occurred fewer than K times). In some of our models, half of the unique features occur only once in a training set of billions of examples. Google Confidential and Proprietary KDD 2013
  • 12. Storing Coefficients with Fewer Bits The learning rate tells us how accurately we might know the coefficient. ● If training on one example might change the coefficient by Δ, you don't know the value with more precision than Δ. ● So, don't waste memory storing a full-precision floating-point value! (32-64 bits) ● Adaptive schemes are possible ● But in practice, storing a 16-bit fixed point representation is easy and works very well. But, you need to be careful about accumulating errors. The trick: randomized rounding Large-Scale Learning with Less RAM via Randomization. Golovin, Sculley, McMahan, Young. ICML 2013. Google Confidential and Proprietary KDD 2013
  • 13. Training with Randomized Rounding Compute the usual update (Gradient Descent or FTRL-Proximal) at full precision, then randomly project to an adjacent value expressible in the fixed-point encoding. Use unbiased rounding: E[rounded coefficient] = unrounded coefficient No accuracy loss with a 16-bit fixed-point representation, compared to 32- or 64 bit floating point values (and 50-75% less memory) Google Confidential and Proprietary KDD 2013
  • 14. Training Many Similar Models: Individually Training 4 similar models, each model trained separately with coefficients stored in a hash table. Each model uses say 4+4+2 = 10 bytes per coefficient Feature Key (e.g., 32-64 bit hash) Gradient- Squared Sum Model 0 Coefficient (q2.13) Feature Key (e.g., 32-64 bit hash) Gradient- Squared Sum Model 2 Coefficient (q2.13) Google Confidential and Proprietary KDD 2013 Feature Key (e.g., 32-64 bit hash) Gradient- Squared Sum Model 1 Coefficient (q2.13) Feature Key (e.g., 32-64 bit hash) Gradient- Squared Sum Model 3 Coefficient (q2.13) Why many similar models? We may want to try different feature variations, learning rates, regularization strengths, ...
  • 15. Training Many Similar Models: Grouped Data for multiple similar models stored in a single hash table Only 2 bytes per coefficient (plus 12 bytes of overhead) per model Feature Key (e.g., 32-64 bit hash) # Positive Examples # Negative Examples Model 0 Coefficient (q2.13) Model 1 Coefficient (q2.13) Model 2 Coefficient (q2.13) Model 3 Coefficient (q2.13) Common data is only stored once, so cost (memory, cpu, and network) is amortized. The only cost per model variant is for storing the model's coefficient, which can be small (say 16 bit q2. 13 fixed point). Rather than using exact per-coordinate learning rates based on each model's gradients, we use a (good enough) approximation based on the # of observed positive and negative examples. Google Confidential and Proprietary KDD 2013
  • 16. Computing Learning Rates with Counts The previous trick requires using the same learning-rate statistic for all grouped models. Approximate the theory-recommended learning-rate schedule using only counts: N is the number of Negative examples P is the number of Positive examples Google Confidential and Proprietary KDD 2013
  • 17. High-Dimensional Data Visualization for Model Accuracy We don't just care about aggregate accuracy! ● Progressive validation: compute prediction on each example before training on it ● Look at relative changes in accuracy between models ● Data visualization to quickly spot outliers or problem areas Google Confidential and Proprietary KDD 2013
  • 18. High-Dimensional Data Visualization for Model Accuracy Google Confidential and Proprietary KDD 2013
  • 19. Automated Feature Management System Many raw signals: words-in-user-query, country-of-origin, etc Lots of engineers working on signals, multiple different learning platforms / teams consuming them. A metadata index to manage all these signals: ● what systems are using what signals ● what signals are available to different systems ● status and version information / deprecation ● whitelists for production readiness / test status Used for automatic testing, alerting, notification, cleanup of unused signals, etc. Google Confidential and Proprietary KDD 2013
  • 20. Also in the paper … ● Fast approximate training with a single value structure ● Unbiased training on a biased subsample of training data ● Assessing model accuracy with progressive validation ● Cheap confidence estimates ● Approaches for calibrating predictions Techniques that didn't work as well as expected ● Aggressive feature hashing ● Randomized feature dropout ● Averaging models trained on different subsets of the features ● Feature vector normalization Google Confidential and Proprietary KDD 2013
  • 21. Thanks for listening! And thanks to all my co-authors: Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica Google Confidential and Proprietary KDD 2013