Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

Ilya Trofimov, Yandex
trofim@yandex-team.ru
Yandex School of Data Analysis conference
Machine Learning and Very Large Data Sets 2013

User’s
query
Ads
Organic
results

 Advertisers select keywords describing their
product or service;
 Ad is eligible to appear at the search engine
result page if ad’s keyword is a subset of
user’s query
 Example: keyword = “digital camera”
 Possible queries:
 “buy digital camera”
 “cheap digital camera”
 “digital camera samsung”
 “digital camera magazine”

 Advertiser is charged each time when his ad
is clicked by a user;
 Advertisers report their bids;
 Advertisers are selected via the Generalized
Second-Price Auction;
 Revenue of Yandex ≈
 The goal is to find P(click|x), x – is a vector
of the all available input features
( )i i
i
P click bid

 The most important input features are the
historical click-through rates (CTR)
 Example of input features:
 CTR(ad) = clicks(ad) / views(ad)
 CTR(web site) = clicks(web site) / views(web site)
 ….
 Text relevance of query and ad’s text
 User behavior features
 There 54 real-valued features total

 Query: “cheap digital camera”
 We selected 3.4*106 binary text-based
features
1, 1,
2 , 2 ,
1 , 0
1 , 0
1 ( ) & ( ),
0
k k k
k k k
km k m
km
x if word keyword otherwise x
x if word residual of query otherwise x
x if word query word residual of query
otherwise x
keywordresidual of query

 The state-of-art solution for the click prediction problem is
to use a composition of boosted decision trees:
 - a decision tree
 Works well for <1000 real-valued features on big datasets
(> 1 million of examples)
 The problem: we want to use millions of binary features
( , )i
f ax
1
1
( | )
1 exp ( , )
n
i i
i
P click
f a
x
x

 The mixed model is a composition of the
decisions trees and the logistic regression
which are fitted sequentially:
1. Fit by means of the boosting;
2. Fit as a logistic regression with L1-
regularization
1
1 1
1
( | ) | |
1 exp ( , )
m
j
n m
j
i i j j
i j
P click
f a z
x
x
, ( , )i i
f ax
i

 For fitting the composition of decision trees we
used MatrixNet
 MatrixNet is a proprietary machine learning
algorithm which is a modiﬁcation of the Gradient
Boosting Machine (GBM) with stochastic boosting
(Friedman, 2002), (Gulin, 2010) (in Russian)
 The training set were randomly sampled from
one week log of user search sessions
 Training set: 3*106 examples
 54 real-valued features

1. Cyclic coordinate descent
Implemented in BBR, (Genkin et.al. 2007)
http://www.bayesianregression.org/
2. Online learning via truncated gradient
Implemented in the Vowpal Wabbit (Langford et al.,
2009)
https://github.com/JohnLangford/vowpal_wabbit
3. Reducing L1-regularization to L2-regularization
(η-trick)
(Jenatton et al., 2009)
Vowpal Wabbit can be used for solving L2-regularized
logistic regression

 The datasets were randomly sampled from
one week log of user search sessions
 Training set: 67*106 examples
 Test set: 5*106 examples
 3.4*106 unique binary features
 Features which had non-zero coefficients
in > 10 training examples were left

0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Основной Основной Основной Основной Основной Основной Основной Основной
ΔauPRC, %
Non-zero coefficients
BBR L1
VW batch LBFGS L2
VW, L1, 1 epoch
VW, L1, 8 epochs
eta-trick

 We selected the model with 2966 non-zero
features
 BBR with 1
100

 Words from the residual of a query which
increase the probability of click
(translated to English):
Word β
gold +0.52
necessary +0.32
market +0.23
used +0.20
effective +0.19

 Words from the residual of a query which
decrease the probability of click
(translated to English):
Word β
vacancy -0.40
review -0.34
site -0.33
size -0.15
which -0.14

 J. Friedman. Greedy function approximation: A gradient
boosting machine. In Technical Report. Dept. of Statistics,
Stanford University, 1999.
 A. Gulin. Matrixnet. Technical report,
http://www.ashmanov.com/arc/searchconf2010/08gulin-
searchconf2010.ppt, 2010. (in Russian).
 A. Genkin, D. D. Lewis, and D. Madigan. Large-Scale
Bayesian Logistic Regression for Text Categorization.
Technometrics, 49(3):291–304, Aug. 2007.
 J. Langford, L. Li, and T. Zhang. Sparse Online Learning via
Truncated Gradient. Journal of Machine Learning
Research, 10:777–801, 2009.
 R. Jenatton, G. Obozinski, and F. Bach. Structured Sparse
Principal Component Analysis, 2009.

Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

Recommended

Recommended

More Related Content

Similar to Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

Similar to Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013 (20)

Recently uploaded

Recently uploaded (20)

Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013