Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

MLconf
MLconfMLconf
Many Shades of Scale:
Big Learning
Beyond Big Data
Misha Bilenko
Principal Researcher
Microsoft Azure Machine Learning
ML ♥ More Data
What we see in production
[Banko and Brill, 2001]
What we [used to] learn in school
[Mooney, 1996]
ML ♥ More Data
What we see in production
[Banko and Brill, 2001]
Is training on
more examples
all there is to it?
Big Learning ≠ Learning(BigData)
• Big data: size → distributing storage and processing
• Big learning: scale bottlenecks in training and prediction
• Classic bottlenecks: bytes and cycles
Large datasets → distribute training on larger hardware (FPGAs, GPUs, cores, clusters)
• Other scaling dimensions
Features Components/People
5
Learning from Counts
with
DRACuLa
Distributed Robust Algorithm for Count-based Learning
joint work with Chris Meek (MSR)
Wenhan Wang, Pete Luferenko (Azure ML)
Scaling to many Features
Learning with relational data
𝑝(𝑐𝑙𝑖𝑐𝑘|𝑎𝑑,𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑢𝑠𝑒𝑟) adid = 1010054353
adText = K2 ski sale!
adURL= www.k2.com/sale
Userid = 0xb49129827048dd9b
IP = 131.107.65.14
Query = powder skis
QCategories = {skiing, outdoor gear}
6
#𝑢𝑠𝑒𝑟𝑠~109 #𝑞𝑢𝑒𝑟𝑖𝑒𝑠~109+ #𝑎𝑑𝑠~107 # 𝑎𝑑 × 𝑞𝑢𝑒𝑟𝑦 ~1010+
• Information retrieval
• Advertising, recommending, search: item, page/query, user
• Transaction classification
• Payment fraud: transaction, product, user
• Email spam: message, sender, recipient
• Intrusion detection: session, system, user
• IoT: device, location
Learning with relational data
𝑝(𝑐𝑙𝑖𝑐𝑘|𝑢𝑠𝑒𝑟,𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑎𝑑)
adid: 1010054353
adText: Fall ski sale!
adURL: www.k2.com/sale
userid 0xb49129827048dd9b
IP 131.107.65.14
query powder skis
qCategories {skiing, outdoor gear}
7
• Problem: representing high-cardinality attributes as features
• Scalable: to billions of attribute values
• Efficient: ~105+
predictions/sec/node
• Flexible: for a variety of downstream learners
• Adaptive: to distribution change
• Standard approaches: binary features, hashing
• What everyone should use in industry: learning with counts
• Formalization and generalization
Standard approach 1: binary (one-hot, indicator)
Attributes are mapped to indices based on lookup tables
- Not scalable cannot support high-cardinality attributes
- Not efficient large value-index dictionary must be retained
- Not flexible only linear learners are practical
- Not adaptive doesn’t support drift in attribute values
0010000..00 0..01000000 00000..001 0..00001000
#userIPs #ads #queries #queries x #ads
𝑖𝑑𝑥 𝑢 131.107.65.14 𝑖𝑑𝑥 𝑞 𝑝𝑜𝑤𝑑𝑒𝑟 𝑠𝑘𝑖𝑠𝑖𝑑𝑥 𝑎 𝑘2. 𝑐𝑜𝑚 𝑖𝑑𝑥 𝑝𝑜𝑤𝑑𝑒𝑟 𝑠𝑘𝑖𝑠, 𝑘2. 𝑐𝑜𝑚
8
Standard approach 1+: feature hashing
Attributes are mapped to indices via hashing: ℎ 𝑥𝑖 = ℎ𝑎𝑠ℎ 𝑥𝑖 mod 𝑚
• Collisions are rare; dot products unbiased
+ Scalable no mapping tables
+ Efficient low cost, preserves sparsity
- Not flexible only linear learners are practical
± Adaptive new values ok, no temporal effects
0000010..0000010000..0000010...000001000
ℎ powder skis + k2. com
ℎ powder skis
ℎ k2. com
ℎ 131.107.65.14
𝑚 ∼ 107
[Moody ‘89, Tarjan-Skadron ‘05, Weinberger+ ’08]
9
𝜙(𝑥)
Learning with counts
• Features are per-label counts [+odds] [+backoff]
𝝓 = [N+ N- log(N+)-log(N-) IsRest]
• log(N+)-log(N-) = log
𝒑(+)
𝒑(−)
: log-odds/Naïve Bayes estimate
• N+, N-: indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
131.107.65.14
𝐶𝑜𝑢𝑛𝑡𝑠(131.107.65.14) 𝐶𝑜𝑢𝑛𝑡𝑠(k2.com)
k2.com
𝐶𝑜𝑢𝑛𝑡𝑠(powder skis)
powder skis
𝐶𝑜𝑢𝑛𝑡𝑠(powder skis, k2.com)
powder skis, k2.com
IP 𝑵+ 𝑵−
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝑰𝑷)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒂𝒅)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚, 𝒂𝒅))
Learning with counts
• Features are per-label counts [+odds] [+backoff]
𝝓 = [N+ N- log(N+)-log(N-) IsRest]
+ Scalable “head” in memory + tail in backoff; or: count-min sketch
+ Efficient low cost, low dimensionality
+ Flexible low dimensionality works well with non-linear learners
+ Adaptive new values easily added, back-off for infrequent values, temporal counts
𝝓(𝑪𝒐𝒖𝒏𝒕𝒔(𝒖𝒔𝒆𝒓)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔(𝒂𝒅)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔(𝒒𝒖𝒆𝒓𝒚) 𝝓(𝑪(𝒒𝒖𝒆𝒓𝒚 × 𝒂𝒅))
131.107.65.14
𝐶𝑜𝑢𝑛𝑡𝑠(131.107.65.14) 𝐶𝑜𝑢𝑛𝑡𝑠(k2.com)
k2.com
𝐶𝑜𝑢𝑛𝑡𝑠(powder skis)
powder skis
𝐶𝑜𝑢𝑛𝑡𝑠(powder skis, k2.com)
powder skis, k2.com
𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝑰𝑷)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒂𝒅)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚, 𝒂𝒅))
IP 𝑵+ 𝑵−
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Backoff is a pain. Count-Min Sketches to the Rescue!
[Cormode-Muthukrishnan ‘04]
Intuition: correct for collisions by using multiple hashes
Featurize: 𝑚𝑖𝑛𝑗 (𝑀[𝑗][ℎ𝑗(𝑖)]) Estimation Time : O(d)
= M (d x w)
Count: for each hash function M[j][hj(i)] ++ Update Time: O(d)
Learning from counts: aggregation
Aggregate 𝐶𝑜𝑢𝑛𝑡(𝑦, 𝑏𝑖𝑛 𝑥 ) for different 𝑏𝑖𝑛 𝑥
• Standard MapReduce
• Bin function: any projection
• Backoff options: “tail bin”, hashing, hierarchical (shrinkage)
IP 𝑵+ 𝑵−
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query 𝑵+ 𝑵−
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
Query × AdId 𝑵+ 𝑵−
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 4419312 52754683
timeTnow
Counting
IP[2] 𝑵+ 𝑵−
173.194.*.* 46964 993424
87.250.*.* 6341 91356
131.253.*.* 75126 430826
… … …
13
Learning from counts: combiner training
IP 𝑵+ 𝑵−
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query 𝑵+ 𝑵−
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
timeTnow
Train predictor
….
IsBackoff
ln 𝑁+
− ln 𝑁−
Aggregated
features
Original numeric features
𝑁−
𝑁+
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId 𝑵+ 𝑵−
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 4419312 52754683
14
Prediction with counts
IP 𝑵+ 𝑵−
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query 𝑵+ 𝑵−
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
URL × Country 𝑵+ 𝑵−
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982
… … …
REST 4419312 52754683
time
Tnow
….
IsBackoff
ln 𝑁+
− ln 𝑁−
Aggregated
features
𝑁−
𝑁+
Counting →
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Where did it come from?
Li et al. 2010
Pavlov et al. 2009
Lee et al. 1998
Yeh and Patt, 1991
16
Hillard et al. 2011
• De-facto standard in online advertising industry
• Rediscovered by everyone who really cares about accuracy
Do we need to separate counting and training?
• Can we use use same data for both counting and featurization
• Bad idea: leakage = count features contain labels → overfitting
• Combiner dedicates capacity to decoding example’s label from features
• Can we hold out each example’s label during train-set featurization?
• Bad idea: leakage and bias
• Illustration: two examples, same feature values, different labels (click and non-click)
• Different representations are inconsistent and allow decoding the label
Train predictorCounting
Example ID Label N+[a] N-[a]
1 + 𝑁𝑎
+
− 1 𝑁 𝑎
−
2 - 𝑁 𝑎
+
𝑁 𝑎
−
-1
Solution via Differential privacy
• What is leakage? Revealing information about any individual label
• Formally: count table cT is ε-leakage-proof if same features for ∀𝑥, 𝑇, 𝑇′ = 𝑇(𝑥𝑖, 𝑦𝑖)
• Theorem: adding noise sampled from Laplace(k/𝜖) makes counts 𝜖-leakage-proof
• Typically 1 ≤ 𝑘 ≤ 100
• Concretely: N+ = N+ + LaplaceRand(0,10k) N- = N- + LaplaceRand(0,10k)
• In practice: LaplaceRand(0,1) sufficient
Learning from counts: why it works
• State-of-the-art accuracy
• Easy to implement on standard clusters
• Monitorable and debuggable
• Temporal changes easy to monitor
• Easy emergency recovery (bot attacks, etc.)
• Error debugging (which feature to blame)
• Modular (vs. monolithic)
• Components: learners and count features
• People: multiple feature/learner authors
19
Big Learning: Pipelines and Teams
Ravi: text features in R
Jim: matrix projections
Vera: sweeping boosted trees
Steph: count features
on Hadoop
How to scale up Machine Learning to
Parallel and Distributed Data Scientists?
AzureML
• Cloud-hosted, graphical environment
for creating, training, evaluating, sharing, and deploying
machine learning models
• Supports versioning and collaboration
• Dozens of ML algorithms, extensible via R and Python
APIML STUDIO
Learning with Counts in Azure ML
Criteo 1TB dataset
Counting:
an hour on HDInsight Hadoop cluster
Training:
minutes in AzureML Studio
Deployment
one click to RRS service
Maximizing Utilization: Keeping it Asynchronous
• Macro-level: concurrently executing pipelines
• Micro-level: asynchronous optimization (with overwriting updates)
• Hogwild SGD [Recht-Re], Downpour SGD [Google Brain]
• Parameter Server [Smola et al.]
• GraphLab [Guestrin et al.]
• SA-SDCA [Tran, Hosseini, Xiao, Finley, B.]
Semi-Asynchronous SDCA:
state-of-the-art linear learning
• SDCA: Stochastic Dual Coordinate Ascent [Shalev-Schwartz & Zhang]
• Plot: SGD marries SVM and they have a beautiful baby
• Algorithm: for each example: update example’s 𝛼𝑖, then re-estimate weights
• Let’s make it asynchronous, Hogwild-style!
• Problem: primal and dual diverge
• Solution: separate thread for primal-dual synchronization
• Taking it out-of-memory: block pseudo-random data loading
SGD update
𝑤 𝑡+1
← 𝑤 𝑡
−𝛾𝑡 𝜆𝑤 𝑡
− 𝑦𝑖 𝜙𝑖
′
(𝑤 𝑡
⋅ 𝑥𝑖) 𝑥𝑖
SDCA update
𝛼𝑖
𝑡
← 𝛼𝑖
𝑡−1
+ Δ𝛼𝑖
𝑤 𝑡
← 𝑤 𝑡−1
+
Δ𝛼𝑖
𝜆𝑛
𝑥𝑖
Keeping it asynchronous: it pays off
In closing: Big Learning = Streetfighting
• Big features are resource-hungry: learning with counts, projections…
• Make them distributed and easy to compute/monitor
• Big learners are resource-hungry
• Parallelize them (preferably asynchronously)
• Big pipelines are resource-hungry: authored by many humans
• Run them a collaborative cloud environment
1 of 27

Recommended

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... by
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
35.3K views54 slides
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data by
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive DataSumit Rangwala
2.1K views63 slides
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl... by
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
4K views97 slides
Anatomy of an eCommerce Search Engine by Mayur Datar by
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarNaresh Jain
1.5K views25 slides
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial by
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
34.3K views78 slides
A Multi-Armed Bandit Framework For Recommendations at Netflix by
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
11.1K views45 slides

More Related Content

What's hot

Netflix Recommendations - Beyond the 5 Stars by
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsXavier Amatriain
21.1K views82 slides
Feature Engineering by
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
10.7K views45 slides
Feature Engineering for ML - Dmitry Larko, H2O.ai by
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
2.5K views40 slides
Shallow and Deep Latent Models for Recommender System by
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemAnoop Deoras
2.6K views32 slides
Deep Learning for Personalized Search and Recommender Systems by
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
37.4K views113 slides
ML Infrastracture @ Dropbox by
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox Tsahi Glik
924 views30 slides

What's hot(20)

Netflix Recommendations - Beyond the 5 Stars by Xavier Amatriain
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
Xavier Amatriain21.1K views
Feature Engineering by Sri Ambati
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati10.7K views
Feature Engineering for ML - Dmitry Larko, H2O.ai by Sri Ambati
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati2.5K views
Shallow and Deep Latent Models for Recommender System by Anoop Deoras
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
Anoop Deoras2.6K views
Deep Learning for Personalized Search and Recommender Systems by Benjamin Le
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
Benjamin Le37.4K views
ML Infrastracture @ Dropbox by Tsahi Glik
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
Tsahi Glik924 views
Time, Context and Causality in Recommender Systems by Yves Raimond
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
Yves Raimond5.9K views
Facebook Talk at Netflix ML Platform meetup Sep 2019 by Faisal Siddiqi
Facebook Talk at Netflix ML Platform meetup Sep 2019Facebook Talk at Netflix ML Platform meetup Sep 2019
Facebook Talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi1.9K views
ML Infra for Netflix Recommendations - AI NEXTCon talk by Faisal Siddiqi
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi13.5K views
Counterfactual evaluation of machine learning models by Michael Manapat
Counterfactual evaluation of machine learning modelsCounterfactual evaluation of machine learning models
Counterfactual evaluation of machine learning models
Michael Manapat20.5K views
Counterfactual Learning for Recommendation by Olivier Jeunen
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Olivier Jeunen633 views
Feature Engineering by HJ van Veen
Feature EngineeringFeature Engineering
Feature Engineering
HJ van Veen151K views
Recommendation at Netflix Scale by Justin Basilico
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
Justin Basilico21.6K views
How to build a recommender system? by blueace
How to build a recommender system?How to build a recommender system?
How to build a recommender system?
blueace25.6K views
Recent Trends in Personalization: A Netflix Perspective by Justin Basilico
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
Justin Basilico30.3K views

Viewers also liked

Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15 by
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf
3.4K views29 slides
Jason Baldridge, Associate Professor of Computational Linguistics, University... by
Jason Baldridge, Associate Professor of Computational Linguistics, University...Jason Baldridge, Associate Professor of Computational Linguistics, University...
Jason Baldridge, Associate Professor of Computational Linguistics, University...MLconf
958 views84 slides
Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16 by
Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16
Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16MLconf
1.5K views63 slides
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016 by
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016MLconf
1.6K views25 slides
Daniel Shank, Data Scientist, Talla at MLconf SF 2016 by
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016MLconf
1.8K views31 slides
10 R Packages to Win Kaggle Competitions by
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
117.7K views19 slides

Viewers also liked(7)

Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15 by MLconf
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
MLconf3.4K views
Jason Baldridge, Associate Professor of Computational Linguistics, University... by MLconf
Jason Baldridge, Associate Professor of Computational Linguistics, University...Jason Baldridge, Associate Professor of Computational Linguistics, University...
Jason Baldridge, Associate Professor of Computational Linguistics, University...
MLconf958 views
Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16 by MLconf
Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16
Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16
MLconf1.5K views
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016 by MLconf
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
MLconf1.6K views
Daniel Shank, Data Scientist, Talla at MLconf SF 2016 by MLconf
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf1.8K views
10 R Packages to Win Kaggle Competitions by DataRobot
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
DataRobot117.7K views
Tips for data science competitions by Owen Zhang
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang85.6K views

Similar to Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Learning with counts by
Learning with countsLearning with counts
Learning with countsRamaBadrinath2
3 views18 slides
Machine learning workshop @DYP Pune by
Machine learning workshop @DYP PuneMachine learning workshop @DYP Pune
Machine learning workshop @DYP PuneGanesh Raskar
299 views53 slides
Keynote at IWLS 2017 by
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
743 views40 slides
Designing Artificial Intelligence by
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial IntelligenceDavid Chou
500 views43 slides
Big Data, Bigger Analytics by
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
790 views24 slides
Deep Learning Introduction - WeCloudData by
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudDataWeCloudData
4K views80 slides

Similar to Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15(20)

Machine learning workshop @DYP Pune by Ganesh Raskar
Machine learning workshop @DYP PuneMachine learning workshop @DYP Pune
Machine learning workshop @DYP Pune
Ganesh Raskar299 views
Designing Artificial Intelligence by David Chou
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
David Chou500 views
Big Data, Bigger Analytics by Itzhak Kameli
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli790 views
Deep Learning Introduction - WeCloudData by WeCloudData
Deep Learning Introduction - WeCloudDataDeep Learning Introduction - WeCloudData
Deep Learning Introduction - WeCloudData
WeCloudData4K views
Toronto meetup 20190917 by Bill Liu
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu393 views
Feature Engineering - Getting most out of data for predictive models - TDC 2017 by Gabriel Moreira
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira10.1K views
The Machine Learning Workflow with Azure by Ivo Andreev
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev6.4K views
The Data Science Process - Do we need it and how to apply? by Ivo Andreev
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev6K views
The Power of Auto ML and How Does it Work by Ivo Andreev
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev2.8K views
A Hands-on Intro to Data Science and R Presentation.ppt by Sanket Shikhar
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar28 views
From CAD to Classroom Final 17 Apr 15 by Nick Palfrey
From CAD to Classroom Final 17 Apr 15From CAD to Classroom Final 17 Apr 15
From CAD to Classroom Final 17 Apr 15
Nick Palfrey159 views
Machine learning & Time Series Analysis , Finlab CTO 韓承佑 by TaiLiLuo
Machine learning & Time Series Analysis ,  Finlab CTO 韓承佑Machine learning & Time Series Analysis ,  Finlab CTO 韓承佑
Machine learning & Time Series Analysis , Finlab CTO 韓承佑
TaiLiLuo1.6K views
Scaling & Transforming Stitch Fix's Visibility into What Folks will love by June Andrews
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
June Andrews76 views
Data Science Challenge presentation given to the CinBITools Meetup Group by Doug Needham
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham483 views

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments... by
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
946 views15 slides
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding by
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
634 views49 slides
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re... by
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
535 views18 slides
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush by
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
749 views25 slides
Josh Wills - Data Labeling as Religious Experience by
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
627 views22 slides
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai... by
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
614 views60 slides

More from MLconf(20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments... by MLconf
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
MLconf946 views
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding by MLconf
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf634 views
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re... by MLconf
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf535 views
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush by MLconf
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
MLconf749 views
Josh Wills - Data Labeling as Religious Experience by MLconf
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
MLconf627 views
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai... by MLconf
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
MLconf614 views
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea... by MLconf
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf954 views
Meghana Ravikumar - Optimized Image Classification on the Cheap by MLconf
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
MLconf371 views
Noam Finkelstein - The Importance of Modeling Data Collection by MLconf
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
MLconf304 views
June Andrews - The Uncanny Valley of ML by MLconf
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
MLconf423 views
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks by MLconf
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
MLconf451 views
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D... by MLconf
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf444 views
Vito Ostuni - The Voice: New Challenges in a Zero UI World by MLconf
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf303 views
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection... by MLconf
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
MLconf811 views
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip... by MLconf
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
MLconf573 views
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o... by MLconf
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
MLconf650 views
Neel Sundaresan - Teaching a machine to code by MLconf
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf1K views
Soumith Chintala - Increasing the Impact of AI Through Better Software by MLconf
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
MLconf646 views
Roy Lowrance - Predicting Bond Prices: Regime Changes by MLconf
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
MLconf426 views
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and... by MLconf
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
MLconf931 views

Recently uploaded

Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentationssuserb54b561
22 views27 slides
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
23 views15 slides
Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
15 views36 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
48 views69 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
61 views38 slides
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
25 views38 slides

Recently uploaded(20)

Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56122 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec15 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays17 views

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

  • 1. Many Shades of Scale: Big Learning Beyond Big Data Misha Bilenko Principal Researcher Microsoft Azure Machine Learning
  • 2. ML ♥ More Data What we see in production [Banko and Brill, 2001] What we [used to] learn in school [Mooney, 1996]
  • 3. ML ♥ More Data What we see in production [Banko and Brill, 2001] Is training on more examples all there is to it?
  • 4. Big Learning ≠ Learning(BigData) • Big data: size → distributing storage and processing • Big learning: scale bottlenecks in training and prediction • Classic bottlenecks: bytes and cycles Large datasets → distribute training on larger hardware (FPGAs, GPUs, cores, clusters) • Other scaling dimensions Features Components/People
  • 5. 5 Learning from Counts with DRACuLa Distributed Robust Algorithm for Count-based Learning joint work with Chris Meek (MSR) Wenhan Wang, Pete Luferenko (Azure ML) Scaling to many Features
  • 6. Learning with relational data 𝑝(𝑐𝑙𝑖𝑐𝑘|𝑎𝑑,𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑢𝑠𝑒𝑟) adid = 1010054353 adText = K2 ski sale! adURL= www.k2.com/sale Userid = 0xb49129827048dd9b IP = 131.107.65.14 Query = powder skis QCategories = {skiing, outdoor gear} 6 #𝑢𝑠𝑒𝑟𝑠~109 #𝑞𝑢𝑒𝑟𝑖𝑒𝑠~109+ #𝑎𝑑𝑠~107 # 𝑎𝑑 × 𝑞𝑢𝑒𝑟𝑦 ~1010+ • Information retrieval • Advertising, recommending, search: item, page/query, user • Transaction classification • Payment fraud: transaction, product, user • Email spam: message, sender, recipient • Intrusion detection: session, system, user • IoT: device, location
  • 7. Learning with relational data 𝑝(𝑐𝑙𝑖𝑐𝑘|𝑢𝑠𝑒𝑟,𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑎𝑑) adid: 1010054353 adText: Fall ski sale! adURL: www.k2.com/sale userid 0xb49129827048dd9b IP 131.107.65.14 query powder skis qCategories {skiing, outdoor gear} 7 • Problem: representing high-cardinality attributes as features • Scalable: to billions of attribute values • Efficient: ~105+ predictions/sec/node • Flexible: for a variety of downstream learners • Adaptive: to distribution change • Standard approaches: binary features, hashing • What everyone should use in industry: learning with counts • Formalization and generalization
  • 8. Standard approach 1: binary (one-hot, indicator) Attributes are mapped to indices based on lookup tables - Not scalable cannot support high-cardinality attributes - Not efficient large value-index dictionary must be retained - Not flexible only linear learners are practical - Not adaptive doesn’t support drift in attribute values 0010000..00 0..01000000 00000..001 0..00001000 #userIPs #ads #queries #queries x #ads 𝑖𝑑𝑥 𝑢 131.107.65.14 𝑖𝑑𝑥 𝑞 𝑝𝑜𝑤𝑑𝑒𝑟 𝑠𝑘𝑖𝑠𝑖𝑑𝑥 𝑎 𝑘2. 𝑐𝑜𝑚 𝑖𝑑𝑥 𝑝𝑜𝑤𝑑𝑒𝑟 𝑠𝑘𝑖𝑠, 𝑘2. 𝑐𝑜𝑚 8
  • 9. Standard approach 1+: feature hashing Attributes are mapped to indices via hashing: ℎ 𝑥𝑖 = ℎ𝑎𝑠ℎ 𝑥𝑖 mod 𝑚 • Collisions are rare; dot products unbiased + Scalable no mapping tables + Efficient low cost, preserves sparsity - Not flexible only linear learners are practical ± Adaptive new values ok, no temporal effects 0000010..0000010000..0000010...000001000 ℎ powder skis + k2. com ℎ powder skis ℎ k2. com ℎ 131.107.65.14 𝑚 ∼ 107 [Moody ‘89, Tarjan-Skadron ‘05, Weinberger+ ’08] 9 𝜙(𝑥)
  • 10. Learning with counts • Features are per-label counts [+odds] [+backoff] 𝝓 = [N+ N- log(N+)-log(N-) IsRest] • log(N+)-log(N-) = log 𝒑(+) 𝒑(−) : log-odds/Naïve Bayes estimate • N+, N-: indicators of confidence of the naïve estimate • IsFromRest: indicator of back-off vs. “real count” 131.107.65.14 𝐶𝑜𝑢𝑛𝑡𝑠(131.107.65.14) 𝐶𝑜𝑢𝑛𝑡𝑠(k2.com) k2.com 𝐶𝑜𝑢𝑛𝑡𝑠(powder skis) powder skis 𝐶𝑜𝑢𝑛𝑡𝑠(powder skis, k2.com) powder skis, k2.com IP 𝑵+ 𝑵− 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 12 430 … … … REST 745623 13964931 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝑰𝑷)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒂𝒅)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚, 𝒂𝒅))
  • 11. Learning with counts • Features are per-label counts [+odds] [+backoff] 𝝓 = [N+ N- log(N+)-log(N-) IsRest] + Scalable “head” in memory + tail in backoff; or: count-min sketch + Efficient low cost, low dimensionality + Flexible low dimensionality works well with non-linear learners + Adaptive new values easily added, back-off for infrequent values, temporal counts 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔(𝒖𝒔𝒆𝒓)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔(𝒂𝒅)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔(𝒒𝒖𝒆𝒓𝒚) 𝝓(𝑪(𝒒𝒖𝒆𝒓𝒚 × 𝒂𝒅)) 131.107.65.14 𝐶𝑜𝑢𝑛𝑡𝑠(131.107.65.14) 𝐶𝑜𝑢𝑛𝑡𝑠(k2.com) k2.com 𝐶𝑜𝑢𝑛𝑡𝑠(powder skis) powder skis 𝐶𝑜𝑢𝑛𝑡𝑠(powder skis, k2.com) powder skis, k2.com 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝑰𝑷)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒂𝒅)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚)) 𝝓(𝑪𝒐𝒖𝒏𝒕𝒔 (𝒒𝒖𝒆𝒓𝒚, 𝒂𝒅)) IP 𝑵+ 𝑵− 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 12 430 … … … REST 745623 13964931
  • 12. Backoff is a pain. Count-Min Sketches to the Rescue! [Cormode-Muthukrishnan ‘04] Intuition: correct for collisions by using multiple hashes Featurize: 𝑚𝑖𝑛𝑗 (𝑀[𝑗][ℎ𝑗(𝑖)]) Estimation Time : O(d) = M (d x w) Count: for each hash function M[j][hj(i)] ++ Update Time: O(d)
  • 13. Learning from counts: aggregation Aggregate 𝐶𝑜𝑢𝑛𝑡(𝑦, 𝑏𝑖𝑛 𝑥 ) for different 𝑏𝑖𝑛 𝑥 • Standard MapReduce • Bin function: any projection • Backoff options: “tail bin”, hashing, hierarchical (shrinkage) IP 𝑵+ 𝑵− 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query 𝑵+ 𝑵− facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 Query × AdId 𝑵+ 𝑵− facebook, ad1 54546 978964 facebook, ad2 232343 8431467 dozen roses, ad3 12973 430982 … … … REST 4419312 52754683 timeTnow Counting IP[2] 𝑵+ 𝑵− 173.194.*.* 46964 993424 87.250.*.* 6341 91356 131.253.*.* 75126 430826 … … … 13
  • 14. Learning from counts: combiner training IP 𝑵+ 𝑵− 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query 𝑵+ 𝑵− facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 timeTnow Train predictor …. IsBackoff ln 𝑁+ − ln 𝑁− Aggregated features Original numeric features 𝑁− 𝑁+ Counting Train non-linear model on count-based features • Counts, transforms, lookup properties • Additional features can be injected Query × AdId 𝑵+ 𝑵− facebook, ad1 54546 978964 facebook, ad2 232343 8431467 dozen roses, ad3 12973 430982 … … … REST 4419312 52754683 14
  • 15. Prediction with counts IP 𝑵+ 𝑵− 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query 𝑵+ 𝑵− facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 URL × Country 𝑵+ 𝑵− url1, US 54546 978964 url2, CA 232343 8431467 url3, FR 12973 430982 … … … REST 4419312 52754683 time Tnow …. IsBackoff ln 𝑁+ − ln 𝑁− Aggregated features 𝑁− 𝑁+ Counting → • Counts are updated continuously • Combiner re-training infrequent Ttrain Original numeric features
  • 16. Where did it come from? Li et al. 2010 Pavlov et al. 2009 Lee et al. 1998 Yeh and Patt, 1991 16 Hillard et al. 2011 • De-facto standard in online advertising industry • Rediscovered by everyone who really cares about accuracy
  • 17. Do we need to separate counting and training? • Can we use use same data for both counting and featurization • Bad idea: leakage = count features contain labels → overfitting • Combiner dedicates capacity to decoding example’s label from features • Can we hold out each example’s label during train-set featurization? • Bad idea: leakage and bias • Illustration: two examples, same feature values, different labels (click and non-click) • Different representations are inconsistent and allow decoding the label Train predictorCounting Example ID Label N+[a] N-[a] 1 + 𝑁𝑎 + − 1 𝑁 𝑎 − 2 - 𝑁 𝑎 + 𝑁 𝑎 − -1
  • 18. Solution via Differential privacy • What is leakage? Revealing information about any individual label • Formally: count table cT is ε-leakage-proof if same features for ∀𝑥, 𝑇, 𝑇′ = 𝑇(𝑥𝑖, 𝑦𝑖) • Theorem: adding noise sampled from Laplace(k/𝜖) makes counts 𝜖-leakage-proof • Typically 1 ≤ 𝑘 ≤ 100 • Concretely: N+ = N+ + LaplaceRand(0,10k) N- = N- + LaplaceRand(0,10k) • In practice: LaplaceRand(0,1) sufficient
  • 19. Learning from counts: why it works • State-of-the-art accuracy • Easy to implement on standard clusters • Monitorable and debuggable • Temporal changes easy to monitor • Easy emergency recovery (bot attacks, etc.) • Error debugging (which feature to blame) • Modular (vs. monolithic) • Components: learners and count features • People: multiple feature/learner authors 19
  • 20. Big Learning: Pipelines and Teams Ravi: text features in R Jim: matrix projections Vera: sweeping boosted trees Steph: count features on Hadoop How to scale up Machine Learning to Parallel and Distributed Data Scientists?
  • 21. AzureML • Cloud-hosted, graphical environment for creating, training, evaluating, sharing, and deploying machine learning models • Supports versioning and collaboration • Dozens of ML algorithms, extensible via R and Python
  • 23. Learning with Counts in Azure ML Criteo 1TB dataset Counting: an hour on HDInsight Hadoop cluster Training: minutes in AzureML Studio Deployment one click to RRS service
  • 24. Maximizing Utilization: Keeping it Asynchronous • Macro-level: concurrently executing pipelines • Micro-level: asynchronous optimization (with overwriting updates) • Hogwild SGD [Recht-Re], Downpour SGD [Google Brain] • Parameter Server [Smola et al.] • GraphLab [Guestrin et al.] • SA-SDCA [Tran, Hosseini, Xiao, Finley, B.]
  • 25. Semi-Asynchronous SDCA: state-of-the-art linear learning • SDCA: Stochastic Dual Coordinate Ascent [Shalev-Schwartz & Zhang] • Plot: SGD marries SVM and they have a beautiful baby • Algorithm: for each example: update example’s 𝛼𝑖, then re-estimate weights • Let’s make it asynchronous, Hogwild-style! • Problem: primal and dual diverge • Solution: separate thread for primal-dual synchronization • Taking it out-of-memory: block pseudo-random data loading SGD update 𝑤 𝑡+1 ← 𝑤 𝑡 −𝛾𝑡 𝜆𝑤 𝑡 − 𝑦𝑖 𝜙𝑖 ′ (𝑤 𝑡 ⋅ 𝑥𝑖) 𝑥𝑖 SDCA update 𝛼𝑖 𝑡 ← 𝛼𝑖 𝑡−1 + Δ𝛼𝑖 𝑤 𝑡 ← 𝑤 𝑡−1 + Δ𝛼𝑖 𝜆𝑛 𝑥𝑖
  • 27. In closing: Big Learning = Streetfighting • Big features are resource-hungry: learning with counts, projections… • Make them distributed and easy to compute/monitor • Big learners are resource-hungry • Parallelize them (preferably asynchronously) • Big pipelines are resource-hungry: authored by many humans • Run them a collaborative cloud environment