3. Advertisers select keywords describing their
product or service;
Ad is eligible to appear at the search engine
result page if ad’s keyword is a subset of
user’s query
Example: keyword = “digital camera”
Possible queries:
“buy digital camera”
“cheap digital camera”
“digital camera samsung”
“digital camera magazine”
4. Advertiser is charged each time when his ad
is clicked by a user;
Advertisers report their bids;
Advertisers are selected via the Generalized
Second-Price Auction;
Revenue of Yandex ≈
The goal is to find P(click|x), x – is a vector
of the all available input features
( )i i
i
P click bid
5. The most important input features are the
historical click-through rates (CTR)
Example of input features:
CTR(ad) = clicks(ad) / views(ad)
CTR(web site) = clicks(web site) / views(web site)
….
Text relevance of query and ad’s text
User behavior features
There 54 real-valued features total
6. Query: “cheap digital camera”
We selected 3.4*106 binary text-based
features
1, 1,
2 , 2 ,
1 , 0
1 , 0
1 ( ) & ( ),
0
k k k
k k k
km k m
km
x if word keyword otherwise x
x if word residual of query otherwise x
x if word query word residual of query
otherwise x
keywordresidual of query
7. The state-of-art solution for the click prediction problem is
to use a composition of boosted decision trees:
- a decision tree
Works well for <1000 real-valued features on big datasets
(> 1 million of examples)
The problem: we want to use millions of binary features
( , )i
f ax
1
1
( | )
1 exp ( , )
n
i i
i
P click
f a
x
x
8. The mixed model is a composition of the
decisions trees and the logistic regression
which are fitted sequentially:
1. Fit by means of the boosting;
2. Fit as a logistic regression with L1-
regularization
1
1 1
1
( | ) | |
1 exp ( , )
m
j
n m
j
i i j j
i j
P click
f a z
x
x
, ( , )i i
f ax
i
9. For fitting the composition of decision trees we
used MatrixNet
MatrixNet is a proprietary machine learning
algorithm which is a modification of the Gradient
Boosting Machine (GBM) with stochastic boosting
(Friedman, 2002), (Gulin, 2010) (in Russian)
The training set were randomly sampled from
one week log of user search sessions
Training set: 3*106 examples
54 real-valued features
10. 1. Cyclic coordinate descent
Implemented in BBR, (Genkin et.al. 2007)
http://www.bayesianregression.org/
2. Online learning via truncated gradient
Implemented in the Vowpal Wabbit (Langford et al.,
2009)
https://github.com/JohnLangford/vowpal_wabbit
3. Reducing L1-regularization to L2-regularization
(η-trick)
(Jenatton et al., 2009)
Vowpal Wabbit can be used for solving L2-regularized
logistic regression
11. The datasets were randomly sampled from
one week log of user search sessions
Training set: 67*106 examples
Test set: 5*106 examples
3.4*106 unique binary features
Features which had non-zero coefficients
in > 10 training examples were left
13. We selected the model with 2966 non-zero
features
BBR with 1
100
14. Words from the residual of a query which
increase the probability of click
(translated to English):
Word β
gold +0.52
necessary +0.32
market +0.23
used +0.20
effective +0.19
15. Words from the residual of a query which
decrease the probability of click
(translated to English):
Word β
vacancy -0.40
review -0.34
site -0.33
size -0.15
which -0.14
16. J. Friedman. Greedy function approximation: A gradient
boosting machine. In Technical Report. Dept. of Statistics,
Stanford University, 1999.
A. Gulin. Matrixnet. Technical report,
http://www.ashmanov.com/arc/searchconf2010/08gulin-
searchconf2010.ppt, 2010. (in Russian).
A. Genkin, D. D. Lewis, and D. Madigan. Large-Scale
Bayesian Logistic Regression for Text Categorization.
Technometrics, 49(3):291–304, Aug. 2007.
J. Langford, L. Li, and T. Zhang. Sparse Online Learning via
Truncated Gradient. Journal of Machine Learning
Research, 10:777–801, 2009.
R. Jenatton, G. Obozinski, and F. Bach. Structured Sparse
Principal Component Analysis, 2009.