Upcoming SlideShare
×

Like this presentation? Why not share!

# Internet Mathematics 2011

## on Dec 28, 2011

• 702 views

### Views

Total Views
702
Views on SlideShare
696
Embed Views
6

Likes
1
9
0

### 3 Embeds6

 http://a0.twimg.com 3 http://www.twylah.com 2 https://si0.twimg.com 1

### Report content

• Comment goes here.
Are you sure you want to

## Internet Mathematics 2011Presentation Transcript

• Yandex Relevance Prediction Challenge Overview of “CLL” team’s solutionR. Gareev1 , D. Kalyanov2 , A. Shaykhutdinova1 , N. Zhiltsov1 1 Kazan (Volga Region) Federal University 2 10tracks.ru 28 December 2011 1 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 2 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 3 / 52
• Problem statement Predict document relevance from user behavior a.k.a Implicit Relevance Feedback See also http://imat-relpred.yandex.ru/en for more details 4 / 52
• User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 5 / 52
• Labeled dataGiven judgements for some pairs of documents and queries: a document Dj is relevant for a query Qi from a region R or a document Dj is not relevant for a query Qi from a region R 6 / 52
• The problem Given a set Q of search queries, for each (q, R) ∈ Q provide a sorted list of documents D1 , . . . , Dm that are relevant to q in the region R Area Under the ROC Curve (AUC) averaged over all the test query-pairs is the target evaluation metric 7 / 52
• AUC score Consider list of documents: D1 , . . . , Di , . . . Dm Preﬁx of length i (FPR(i), TPR(i)) gives a single point in ROC curve AUC is the area under ROC curve AUC = Probability that randomly chosen relevant document come before randomly chosen non-relevand document 8 / 52
• Our problem restatement We consider it as a machine learning task Using relevance judgements, learn a classiﬁer H(R, Q, D) that predicts that document D is relevant to a query Q from a region R Replace RegionID, QueryID and DocumentID with related features extracted from click log Use the classiﬁer H(R, Q, D) to compute a list, sorted w.r.t. classiﬁer’s certainty scores, for a query Q from a region R 9 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 10 / 52
• Features A feature is a function of (Q, R, D). Each feature is associated/not associated with its related regionTypes Document features Query features Time-concerned features 11 / 52
• Document features 1 (Q, D) → Number of occurences of an URL in the SERP list 2 (Q, D) → Number of clicks 3 (Q, D) → Click-through rate 4 (Q, D) → Average position in the click sequence 5 (Q, D) → Average rank in the SERP list 6 (Q, D) → Average rank in the SERP list when URL is clicked 7 (Q, D) → Probability of being last clicked 8 (Q, D) → Probability of being ﬁrst clicked 12 / 52
• User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 13 / 52
• Query features 1 (Q) → Average number of clicks in subsession 2 (Q) → Probability of being rewritten (being not last query in session) 3 (Q) → Probability of being resolved (probability of its results being last clicked) 14 / 52
• User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 15 / 52
• Time-concerned features 1 (Q) → Average time to ﬁrst click 2 (Q, D) → Average time spent reading a document D 16 / 52
• User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 17 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 18 / 52
• Two phase extraction 1 Normalization • lookup ﬁltering by ’Important triples’ set • normalization is speciﬁc for each feature 2 Grouping and aggregating 19 / 52
• Important triples 20 / 52
• Normalization Converting clicklog entries to a relational table with the following attributes: • feature domain attributes, e.g: • (Q,R,U), (Q,U) for document features • (Q,R), (Q) for query features • feature attribute value Sequential processing ’session-by-session’ • reject spam sessions • emit values (probably repeated) 21 / 52
• Normalization example (I)Click log (with SessionID, TimePassed omitted): Action QueryID RegionID URLs Q 174 0 1625 1627 1623 2510 2524 Q 1974 0 2091 17562 1626 1623 1627 C 17562 C 1627 C 1625 C 2510Intermediate table for ’Average click position’ feature: QueryID URLID RegionID ClickPosition 1974 17562 0 1 1974 1627 0 2 174 1625 0 1 174 2510 0 2 22 / 52
• Normalization example (II)Click log (sessionID was omitted): Time QueryID RegionID URLs 0 Q 5 0 99 16 87 39 6 C 84 120 Q 558 0 84 5043 5041 5039 125 Q 8768 0 74672 74661 74674 74671 145 C 74661Intermediate table for ’Time to ﬁrst click’ feature: QueryID RegionID FirstClickTime 5 0 6 8768 0 20 23 / 52
• Aggregation example (by triple) 24 / 52
• Aggregation example (by QU-pair) 25 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 26 / 52
• Our ﬁnal ML based solution in a nutshell Binary classiﬁcation task for predicting assessors’ labels 26 features extracted from the click log Gradient Boosted Trees learning model (gbm R package) Tuning model’s parameters w.r.t. AUC averaged over given query-region pairs Ranking URLs according to the best model’s probability scores 27 / 52
• Training data 28 / 52
• Training dataTarget values 29 / 52
• Training dataFeature values 30 / 52
• Training dataMissing values 31 / 52
• Data Analysis Scheme 1 Given initial training and test sets 32 / 52
• Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 32 / 52
• Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 32 / 52
• Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 32 / 52
• Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 5 Obtain the estimates for the target metric on the test set 32 / 52
• Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 5 Obtain the estimates for the target metric on the test set 6 Choose the optimal model, reﬁt it on the whole initial training set and apply it on the initial test set 32 / 52
• Boosting[Schapire, 1990] Given training set (x1 , y1 ), . . . , (xN , yN ), yi ∈ {−1, +1} For t = 1, . . . , T • construct distribution Dt on {1, . . . , N } • sample examples from it concentrating on the “hardest” ones • learn a “weak classiﬁer” (at least better than random) ht : X → {−1, +1} with error t on Dt : t = Pi∼Dt (ht (xi ) = yi ) Output the ﬁnal classiﬁer H as a weighted majority vote of ht 33 / 52
• AdaBoost[Freund & Schapire, 1997] Constructing Dt : • D1 (i) = N 1 • given Dt and ht : Dt (i) e−αt if yi = ht (xi ) Dt+1 (i) = × Zt e αt if yi = ht (xi ) where Zt – normalization factor and 1 1− t αt = ln >0 2 t Final classiﬁer: H(x) = sign αt ht (x) t 34 / 52
• Gradient boosted trees[Friedman, 2001] Stochastic gradient decent optimization of the loss function Decision trees model as a weak classiﬁer Do not require feature normalization There is no need to handle missing values speciﬁcally Reported good performance in relevance prediction problems [Piwowarski et al., 2009], [Hassan et al., 2010] and [Gulin et al., 2011] 35 / 52
• Gradient boosted treesgbm R package implementation There are two available distributions for classiﬁcation tasks: Bernoulli and AdaBoost Three basic parameters: interaction depth (depth of each tree), number of trees (or iterations) and shrinkage (learning rate) 36 / 52
• Logistic regressionglm, stats R package Preprocess the initial training data – imputing missing values with the help of bagged trees Fit the generalized linear model: 1 f (x) = , 1 + e−z where z = β0 + β1 x1 + · · · + βk xk 37 / 52
• Tuning gbmbernoulli model3-fold CV estimate of AUC for the optimal parameters: 0.6457435 38 / 52
• Tuning gbmadaboost model3-fold CV estimate of AUC for the optimal parameters: 0.6455384 39 / 52
• Comparative performance of three optimal modelsTest error estimates Model Optimal parameter values Test estimate of AUC gbmbernoulli interaction.depth=2, 0.6324717 n.trees=500, shrinkage=0.01 gbmadaboost interaction.depth=4, 0.6313393 n.trees=700, shrinkage=0.01 logistic regression - 0.618648 40 / 52
• Variable importance according to the best model 41 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 42 / 52
• Contest statistics 101 participants, 84 of them are eligible for prize Two-stage evaluation procedure: validation set and test set (their sizes were unknown during the contest) Validation set size is ≈ 11 000 instances Test set size is ≈ 20 000 instances 43 / 52
• Preliminary ResultsValidation set19th place (AUC=0.650004) 44 / 52
• Final ResultsTest set34th place (AUC=0.643346) # Team AUC 1 cointegral* 0.667362 2 Evlampiy* 0.66506 3 alsafr* 0.664527 4 alexeigor* 0.663169 5 keinorhasen 0.660982 6 mmp 0.659914 7 Cutter* 0.659452 8 S-n-D 0.658103 ... ... ... 34 CLL 0.643346 ... ... ... 45 / 52
• Acknowledgements We would like to thank: the organizers from Yandex for an exciting challenge E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleagues from Kazan Federal University for fruitful discussions and support 46 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 47 / 52
• References I [Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting // Journal of Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139. [Friedman, 2001] Friedman, J. Greedy Function Approximation: A Gradient Boosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P. 1189-1232. [Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning The Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P. 63-76. [Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG: User behavior as a predictor of a successful search // Proceedings of the third ACM international conference on Web search and data mining. – ACM. – 2010. – P. 221-230. 48 / 52
• References II [Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining User Web Search Activity with Layered Bayesian Networks or How to Capture a Click in its Context // Proceedings of the Second ACM International Conference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171. [Schapire, 1990] Schapire, R. The strength of weak learnability // Machine Learning. – V.5. – No. 2. – 1990. – P. 197–227. 49 / 52
• Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 50 / 52
• Compute AUC for gbm modelComputeAUC <- function(fit, ntrees, testSet) {require(ROCR)require(foreach)require(gbm)pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))count <- nrow(queryRegions)aucValues <- foreach (i=1:count, .combine="c") %do% { queryId <- queryRegions[i,"QueryID"] regionId <- queryRegions[i,"RegionID"] true.labels <- testSet[testSet\$QueryID == queryId & testSet\$RegionID == regionId, ]\$RelevanceLabel m <- mean(true.labels) if (m == 0 | m == 1) { pred <- NA perf <- NA curAUC <- NA } else { gbm.predictions <- predict.gbm(fit, pureTestSet[testSet\$QueryID == queryId & testSet\$RegionID == regionId,], n.trees=ntrees, type="response") pred <- prediction(gbm.predictions, true.labels) perf <- performance(pred, "auc") curAUC <- perf@y.values [[1]] [1] }curAUC}return (mean(aucValues, na.rm=T))} 51 / 52
• Tuning AUC for gbm modelTuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500, step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) { require(gbm) require(foreach) require(caret) require(sqldf) FUN <- match.fun(aucfunction) ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step) folds <- createFolds(trainSet\$QueryID, foldsNum, T, T) aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% { inTrain <- folds[[i]] cvTrainData <- trainSet[inTrain,] cvTestData <- trainSet[-inTrain,] pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID)) gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution, interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage) foreach(n=ntreesSeq, .combine="rbind") %do% { auc <- FUN(gbmFit, n, cvTestData) (c(n, auc)) } } aucvalues <- as.data.frame(aucvalues) avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1") return (avgAuc)} 52 / 52