Yandex School of Data Analysis conference, Machine Learning and Very Large Data Sets 2013

  • 340 views
Uploaded on

Text-based online advertising: L1 Regularization for Feature Selection

Text-based online advertising: L1 Regularization for Feature Selection

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
340
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Ilya Trofimov, Yandex trofim@yandex-team.ru Yandex School of Data Analysis conference Machine Learning and Very Large Data Sets 2013
  • 2. User’s query Ads Organic results
  • 3.  Advertisers select keywords describing their product or service;  Ad is eligible to appear at the search engine result page if ad’s keyword is a subset of user’s query  Example: keyword = “digital camera”  Possible queries:  “buy digital camera”  “cheap digital camera”  “digital camera samsung”  “digital camera magazine”
  • 4.  Advertiser is charged each time when his ad is clicked by a user;  Advertisers report their bids;  Advertisers are selected via the Generalized Second-Price Auction;  Revenue of Yandex ≈  The goal is to find P(click|x), x – is a vector of the all available input features ( )i i i P click bid
  • 5.  The most important input features are the historical click-through rates (CTR)  Example of input features:  CTR(ad) = clicks(ad) / views(ad)  CTR(web site) = clicks(web site) / views(web site)  ….  Text relevance of query and ad’s text  User behavior features  There 54 real-valued features total
  • 6.  Query: “cheap digital camera”  We selected 3.4*106 binary text-based features 1, 1, 2 , 2 , 1 , 0 1 , 0 1 ( ) & ( ), 0 k k k k k k km k m km x if word keyword otherwise x x if word residual of query otherwise x x if word query word residual of query otherwise x keywordresidual of query
  • 7.  The state-of-art solution for the click prediction problem is to use a composition of boosted decision trees:  - a decision tree  Works well for <1000 real-valued features on big datasets (> 1 million of examples)  The problem: we want to use millions of binary features ( , )i f ax 1 1 ( | ) 1 exp ( , ) n i i i P click f a x x
  • 8.  The mixed model is a composition of the decisions trees and the logistic regression which are fitted sequentially: 1. Fit by means of the boosting; 2. Fit as a logistic regression with L1- regularization 1 1 1 1 ( | ) | | 1 exp ( , ) m j n m j i i j j i j P click f a z x x , ( , )i i f ax i
  • 9.  For fitting the composition of decision trees we used MatrixNet  MatrixNet is a proprietary machine learning algorithm which is a modification of the Gradient Boosting Machine (GBM) with stochastic boosting (Friedman, 2002), (Gulin, 2010) (in Russian)  The training set were randomly sampled from one week log of user search sessions  Training set: 3*106 examples  54 real-valued features
  • 10. 1. Cyclic coordinate descent Implemented in BBR, (Genkin et.al. 2007) http://www.bayesianregression.org/ 2. Online learning via truncated gradient Implemented in the Vowpal Wabbit (Langford et al., 2009) https://github.com/JohnLangford/vowpal_wabbit 3. Reducing L1-regularization to L2-regularization (η-trick) (Jenatton et al., 2009) Vowpal Wabbit can be used for solving L2-regularized logistic regression
  • 11.  The datasets were randomly sampled from one week log of user search sessions  Training set: 67*106 examples  Test set: 5*106 examples  3.4*106 unique binary features  Features which had non-zero coefficients in > 10 training examples were left
  • 12. 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Основной Основной Основной Основной Основной Основной Основной Основной ΔauPRC, % Non-zero coefficients BBR L1 VW batch LBFGS L2 VW, L1, 1 epoch VW, L1, 8 epochs eta-trick
  • 13.  We selected the model with 2966 non-zero features  BBR with 1 100
  • 14.  Words from the residual of a query which increase the probability of click (translated to English): Word β gold +0.52 necessary +0.32 market +0.23 used +0.20 effective +0.19
  • 15.  Words from the residual of a query which decrease the probability of click (translated to English): Word β vacancy -0.40 review -0.34 site -0.33 size -0.15 which -0.14
  • 16.  J. Friedman. Greedy function approximation: A gradient boosting machine. In Technical Report. Dept. of Statistics, Stanford University, 1999.  A. Gulin. Matrixnet. Technical report, http://www.ashmanov.com/arc/searchconf2010/08gulin- searchconf2010.ppt, 2010. (in Russian).  A. Genkin, D. D. Lewis, and D. Madigan. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics, 49(3):291–304, Aug. 2007.  J. Langford, L. Li, and T. Zhang. Sparse Online Learning via Truncated Gradient. Journal of Machine Learning Research, 10:777–801, 2009.  R. Jenatton, G. Obozinski, and F. Bach. Structured Sparse Principal Component Analysis, 2009.