Ilya Trofimov, Yandex
trofim@yandex-team.ru
Yandex School of Data Analysis conference
Machine Learning and Very Large Data...
User’s
query
Ads
Organic
results
 Advertisers select keywords describing their
product or service;
 Ad is eligible to appear at the search engine
result ...
 Advertiser is charged each time when his ad
is clicked by a user;
 Advertisers report their bids;
 Advertisers are sel...
 The most important input features are the
historical click-through rates (CTR)
 Example of input features:
 CTR(ad) = ...
 Query: “cheap digital camera”
 We selected 3.4*106 binary text-based
features
1, 1,
2 , 2 ,
1 , 0
1 , 0
1 ( ) & ( ),
0
...
 The state-of-art solution for the click prediction problem is
to use a composition of boosted decision trees:
 - a deci...
 The mixed model is a composition of the
decisions trees and the logistic regression
which are fitted sequentially:
1. Fi...
 For fitting the composition of decision trees we
used MatrixNet
 MatrixNet is a proprietary machine learning
algorithm ...
1. Cyclic coordinate descent
Implemented in BBR, (Genkin et.al. 2007)
http://www.bayesianregression.org/
2. Online learnin...
 The datasets were randomly sampled from
one week log of user search sessions
 Training set: 67*106 examples
 Test set:...
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Основной Основной Основной Основной Основной Основной Основной Основной
ΔauPRC, %
...
 We selected the model with 2966 non-zero
features
 BBR with 1
100
 Words from the residual of a query which
increase the probability of click
(translated to English):
Word β
gold +0.52
ne...
 Words from the residual of a query which
decrease the probability of click
(translated to English):
Word β
vacancy -0.40...
 J. Friedman. Greedy function approximation: A gradient
boosting machine. In Technical Report. Dept. of Statistics,
Stanf...
Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013
Upcoming SlideShare
Loading in …5
×

Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

798 views

Published on

Text-based online advertising: L1 Regularization for Feature Selection

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
798
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

  1. 1. Ilya Trofimov, Yandex trofim@yandex-team.ru Yandex School of Data Analysis conference Machine Learning and Very Large Data Sets 2013
  2. 2. User’s query Ads Organic results
  3. 3.  Advertisers select keywords describing their product or service;  Ad is eligible to appear at the search engine result page if ad’s keyword is a subset of user’s query  Example: keyword = “digital camera”  Possible queries:  “buy digital camera”  “cheap digital camera”  “digital camera samsung”  “digital camera magazine”
  4. 4.  Advertiser is charged each time when his ad is clicked by a user;  Advertisers report their bids;  Advertisers are selected via the Generalized Second-Price Auction;  Revenue of Yandex ≈  The goal is to find P(click|x), x – is a vector of the all available input features ( )i i i P click bid
  5. 5.  The most important input features are the historical click-through rates (CTR)  Example of input features:  CTR(ad) = clicks(ad) / views(ad)  CTR(web site) = clicks(web site) / views(web site)  ….  Text relevance of query and ad’s text  User behavior features  There 54 real-valued features total
  6. 6.  Query: “cheap digital camera”  We selected 3.4*106 binary text-based features 1, 1, 2 , 2 , 1 , 0 1 , 0 1 ( ) & ( ), 0 k k k k k k km k m km x if word keyword otherwise x x if word residual of query otherwise x x if word query word residual of query otherwise x keywordresidual of query
  7. 7.  The state-of-art solution for the click prediction problem is to use a composition of boosted decision trees:  - a decision tree  Works well for <1000 real-valued features on big datasets (> 1 million of examples)  The problem: we want to use millions of binary features ( , )i f ax 1 1 ( | ) 1 exp ( , ) n i i i P click f a x x
  8. 8.  The mixed model is a composition of the decisions trees and the logistic regression which are fitted sequentially: 1. Fit by means of the boosting; 2. Fit as a logistic regression with L1- regularization 1 1 1 1 ( | ) | | 1 exp ( , ) m j n m j i i j j i j P click f a z x x , ( , )i i f ax i
  9. 9.  For fitting the composition of decision trees we used MatrixNet  MatrixNet is a proprietary machine learning algorithm which is a modification of the Gradient Boosting Machine (GBM) with stochastic boosting (Friedman, 2002), (Gulin, 2010) (in Russian)  The training set were randomly sampled from one week log of user search sessions  Training set: 3*106 examples  54 real-valued features
  10. 10. 1. Cyclic coordinate descent Implemented in BBR, (Genkin et.al. 2007) http://www.bayesianregression.org/ 2. Online learning via truncated gradient Implemented in the Vowpal Wabbit (Langford et al., 2009) https://github.com/JohnLangford/vowpal_wabbit 3. Reducing L1-regularization to L2-regularization (η-trick) (Jenatton et al., 2009) Vowpal Wabbit can be used for solving L2-regularized logistic regression
  11. 11.  The datasets were randomly sampled from one week log of user search sessions  Training set: 67*106 examples  Test set: 5*106 examples  3.4*106 unique binary features  Features which had non-zero coefficients in > 10 training examples were left
  12. 12. 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Основной Основной Основной Основной Основной Основной Основной Основной ΔauPRC, % Non-zero coefficients BBR L1 VW batch LBFGS L2 VW, L1, 1 epoch VW, L1, 8 epochs eta-trick
  13. 13.  We selected the model with 2966 non-zero features  BBR with 1 100
  14. 14.  Words from the residual of a query which increase the probability of click (translated to English): Word β gold +0.52 necessary +0.32 market +0.23 used +0.20 effective +0.19
  15. 15.  Words from the residual of a query which decrease the probability of click (translated to English): Word β vacancy -0.40 review -0.34 site -0.33 size -0.15 which -0.14
  16. 16.  J. Friedman. Greedy function approximation: A gradient boosting machine. In Technical Report. Dept. of Statistics, Stanford University, 1999.  A. Gulin. Matrixnet. Technical report, http://www.ashmanov.com/arc/searchconf2010/08gulin- searchconf2010.ppt, 2010. (in Russian).  A. Genkin, D. D. Lewis, and D. Madigan. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics, 49(3):291–304, Aug. 2007.  J. Langford, L. Li, and T. Zhang. Sparse Online Learning via Truncated Gradient. Journal of Machine Learning Research, 10:777–801, 2009.  R. Jenatton, G. Obozinski, and F. Bach. Structured Sparse Principal Component Analysis, 2009.

×