Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

798 views

Published on

Text-based online advertising: L1 Regularization for Feature Selection

No Downloads

Total views

798

On SlideShare

0

From Embeds

0

Number of Embeds

4

Shares

0

Downloads

7

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Ilya Trofimov, Yandex trofim@yandex-team.ru Yandex School of Data Analysis conference Machine Learning and Very Large Data Sets 2013
- 2. User’s query Ads Organic results
- 3. Advertisers select keywords describing their product or service; Ad is eligible to appear at the search engine result page if ad’s keyword is a subset of user’s query Example: keyword = “digital camera” Possible queries: “buy digital camera” “cheap digital camera” “digital camera samsung” “digital camera magazine”
- 4. Advertiser is charged each time when his ad is clicked by a user; Advertisers report their bids; Advertisers are selected via the Generalized Second-Price Auction; Revenue of Yandex ≈ The goal is to find P(click|x), x – is a vector of the all available input features ( )i i i P click bid
- 5. The most important input features are the historical click-through rates (CTR) Example of input features: CTR(ad) = clicks(ad) / views(ad) CTR(web site) = clicks(web site) / views(web site) …. Text relevance of query and ad’s text User behavior features There 54 real-valued features total
- 6. Query: “cheap digital camera” We selected 3.4*106 binary text-based features 1, 1, 2 , 2 , 1 , 0 1 , 0 1 ( ) & ( ), 0 k k k k k k km k m km x if word keyword otherwise x x if word residual of query otherwise x x if word query word residual of query otherwise x keywordresidual of query
- 7. The state-of-art solution for the click prediction problem is to use a composition of boosted decision trees: - a decision tree Works well for <1000 real-valued features on big datasets (> 1 million of examples) The problem: we want to use millions of binary features ( , )i f ax 1 1 ( | ) 1 exp ( , ) n i i i P click f a x x
- 8. The mixed model is a composition of the decisions trees and the logistic regression which are fitted sequentially: 1. Fit by means of the boosting; 2. Fit as a logistic regression with L1- regularization 1 1 1 1 ( | ) | | 1 exp ( , ) m j n m j i i j j i j P click f a z x x , ( , )i i f ax i
- 9. For fitting the composition of decision trees we used MatrixNet MatrixNet is a proprietary machine learning algorithm which is a modiﬁcation of the Gradient Boosting Machine (GBM) with stochastic boosting (Friedman, 2002), (Gulin, 2010) (in Russian) The training set were randomly sampled from one week log of user search sessions Training set: 3*106 examples 54 real-valued features
- 10. 1. Cyclic coordinate descent Implemented in BBR, (Genkin et.al. 2007) http://www.bayesianregression.org/ 2. Online learning via truncated gradient Implemented in the Vowpal Wabbit (Langford et al., 2009) https://github.com/JohnLangford/vowpal_wabbit 3. Reducing L1-regularization to L2-regularization (η-trick) (Jenatton et al., 2009) Vowpal Wabbit can be used for solving L2-regularized logistic regression
- 11. The datasets were randomly sampled from one week log of user search sessions Training set: 67*106 examples Test set: 5*106 examples 3.4*106 unique binary features Features which had non-zero coefficients in > 10 training examples were left
- 12. 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Основной Основной Основной Основной Основной Основной Основной Основной ΔauPRC, % Non-zero coefficients BBR L1 VW batch LBFGS L2 VW, L1, 1 epoch VW, L1, 8 epochs eta-trick
- 13. We selected the model with 2966 non-zero features BBR with 1 100
- 14. Words from the residual of a query which increase the probability of click (translated to English): Word β gold +0.52 necessary +0.32 market +0.23 used +0.20 effective +0.19
- 15. Words from the residual of a query which decrease the probability of click (translated to English): Word β vacancy -0.40 review -0.34 site -0.33 size -0.15 which -0.14
- 16. J. Friedman. Greedy function approximation: A gradient boosting machine. In Technical Report. Dept. of Statistics, Stanford University, 1999. A. Gulin. Matrixnet. Technical report, http://www.ashmanov.com/arc/searchconf2010/08gulin- searchconf2010.ppt, 2010. (in Russian). A. Genkin, D. D. Lewis, and D. Madigan. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics, 49(3):291–304, Aug. 2007. J. Langford, L. Li, and T. Zhang. Sparse Online Learning via Truncated Gradient. Journal of Machine Learning Research, 10:777–801, 2009. R. Jenatton, G. Obozinski, and F. Bach. Structured Sparse Principal Component Analysis, 2009.

No public clipboards found for this slide

Be the first to comment