Internet Mathematics 2011
Upcoming SlideShare
Loading in...5
×
 

Internet Mathematics 2011

on

  • 702 views

 

Statistics

Views

Total Views
702
Views on SlideShare
696
Embed Views
6

Actions

Likes
1
Downloads
9
Comments
0

3 Embeds 6

http://a0.twimg.com 3
http://www.twylah.com 2
https://si0.twimg.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Internet Mathematics 2011 Internet Mathematics 2011 Presentation Transcript

  • Yandex Relevance Prediction Challenge Overview of “CLL” team’s solutionR. Gareev1 , D. Kalyanov2 , A. Shaykhutdinova1 , N. Zhiltsov1 1 Kazan (Volga Region) Federal University 2 10tracks.ru 28 December 2011 1 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 2 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 3 / 52
  • Problem statement Predict document relevance from user behavior a.k.a Implicit Relevance Feedback See also http://imat-relpred.yandex.ru/en for more details 4 / 52
  • User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 5 / 52
  • Labeled dataGiven judgements for some pairs of documents and queries: a document Dj is relevant for a query Qi from a region R or a document Dj is not relevant for a query Qi from a region R 6 / 52
  • The problem Given a set Q of search queries, for each (q, R) ∈ Q provide a sorted list of documents D1 , . . . , Dm that are relevant to q in the region R Area Under the ROC Curve (AUC) averaged over all the test query-pairs is the target evaluation metric 7 / 52
  • AUC score Consider list of documents: D1 , . . . , Di , . . . Dm Prefix of length i (FPR(i), TPR(i)) gives a single point in ROC curve AUC is the area under ROC curve AUC = Probability that randomly chosen relevant document come before randomly chosen non-relevand document 8 / 52
  • Our problem restatement We consider it as a machine learning task Using relevance judgements, learn a classifier H(R, Q, D) that predicts that document D is relevant to a query Q from a region R Replace RegionID, QueryID and DocumentID with related features extracted from click log Use the classifier H(R, Q, D) to compute a list, sorted w.r.t. classifier’s certainty scores, for a query Q from a region R 9 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 10 / 52
  • Features A feature is a function of (Q, R, D). Each feature is associated/not associated with its related regionTypes Document features Query features Time-concerned features 11 / 52
  • Document features 1 (Q, D) → Number of occurences of an URL in the SERP list 2 (Q, D) → Number of clicks 3 (Q, D) → Click-through rate 4 (Q, D) → Average position in the click sequence 5 (Q, D) → Average rank in the SERP list 6 (Q, D) → Average rank in the SERP list when URL is clicked 7 (Q, D) → Probability of being last clicked 8 (Q, D) → Probability of being first clicked 12 / 52
  • User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 13 / 52
  • Query features 1 (Q) → Average number of clicks in subsession 2 (Q) → Probability of being rewritten (being not last query in session) 3 (Q) → Probability of being resolved (probability of its results being last clicked) 14 / 52
  • User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 15 / 52
  • Time-concerned features 1 (Q) → Average time to first click 2 (Q, D) → Average time spent reading a document D 16 / 52
  • User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 17 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 18 / 52
  • Two phase extraction 1 Normalization • lookup filtering by ’Important triples’ set • normalization is specific for each feature 2 Grouping and aggregating 19 / 52
  • Important triples 20 / 52
  • Normalization Converting clicklog entries to a relational table with the following attributes: • feature domain attributes, e.g: • (Q,R,U), (Q,U) for document features • (Q,R), (Q) for query features • feature attribute value Sequential processing ’session-by-session’ • reject spam sessions • emit values (probably repeated) 21 / 52
  • Normalization example (I)Click log (with SessionID, TimePassed omitted): Action QueryID RegionID URLs Q 174 0 1625 1627 1623 2510 2524 Q 1974 0 2091 17562 1626 1623 1627 C 17562 C 1627 C 1625 C 2510Intermediate table for ’Average click position’ feature: QueryID URLID RegionID ClickPosition 1974 17562 0 1 1974 1627 0 2 174 1625 0 1 174 2510 0 2 22 / 52
  • Normalization example (II)Click log (sessionID was omitted): Time QueryID RegionID URLs 0 Q 5 0 99 16 87 39 6 C 84 120 Q 558 0 84 5043 5041 5039 125 Q 8768 0 74672 74661 74674 74671 145 C 74661Intermediate table for ’Time to first click’ feature: QueryID RegionID FirstClickTime 5 0 6 8768 0 20 23 / 52
  • Aggregation example (by triple) 24 / 52
  • Aggregation example (by QU-pair) 25 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 26 / 52
  • Our final ML based solution in a nutshell Binary classification task for predicting assessors’ labels 26 features extracted from the click log Gradient Boosted Trees learning model (gbm R package) Tuning model’s parameters w.r.t. AUC averaged over given query-region pairs Ranking URLs according to the best model’s probability scores 27 / 52
  • Training data 28 / 52
  • Training dataTarget values 29 / 52
  • Training dataFeature values 30 / 52
  • Training dataMissing values 31 / 52
  • Data Analysis Scheme 1 Given initial training and test sets 32 / 52
  • Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 32 / 52
  • Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 32 / 52
  • Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 32 / 52
  • Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 5 Obtain the estimates for the target metric on the test set 32 / 52
  • Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 5 Obtain the estimates for the target metric on the test set 6 Choose the optimal model, refit it on the whole initial training set and apply it on the initial test set 32 / 52
  • Boosting[Schapire, 1990] Given training set (x1 , y1 ), . . . , (xN , yN ), yi ∈ {−1, +1} For t = 1, . . . , T • construct distribution Dt on {1, . . . , N } • sample examples from it concentrating on the “hardest” ones • learn a “weak classifier” (at least better than random) ht : X → {−1, +1} with error t on Dt : t = Pi∼Dt (ht (xi ) = yi ) Output the final classifier H as a weighted majority vote of ht 33 / 52
  • AdaBoost[Freund & Schapire, 1997] Constructing Dt : • D1 (i) = N 1 • given Dt and ht : Dt (i) e−αt if yi = ht (xi ) Dt+1 (i) = × Zt e αt if yi = ht (xi ) where Zt – normalization factor and 1 1− t αt = ln >0 2 t Final classifier: H(x) = sign αt ht (x) t 34 / 52
  • Gradient boosted trees[Friedman, 2001] Stochastic gradient decent optimization of the loss function Decision trees model as a weak classifier Do not require feature normalization There is no need to handle missing values specifically Reported good performance in relevance prediction problems [Piwowarski et al., 2009], [Hassan et al., 2010] and [Gulin et al., 2011] 35 / 52
  • Gradient boosted treesgbm R package implementation There are two available distributions for classification tasks: Bernoulli and AdaBoost Three basic parameters: interaction depth (depth of each tree), number of trees (or iterations) and shrinkage (learning rate) 36 / 52
  • Logistic regressionglm, stats R package Preprocess the initial training data – imputing missing values with the help of bagged trees Fit the generalized linear model: 1 f (x) = , 1 + e−z where z = β0 + β1 x1 + · · · + βk xk 37 / 52
  • Tuning gbmbernoulli model3-fold CV estimate of AUC for the optimal parameters: 0.6457435 38 / 52
  • Tuning gbmadaboost model3-fold CV estimate of AUC for the optimal parameters: 0.6455384 39 / 52
  • Comparative performance of three optimal modelsTest error estimates Model Optimal parameter values Test estimate of AUC gbmbernoulli interaction.depth=2, 0.6324717 n.trees=500, shrinkage=0.01 gbmadaboost interaction.depth=4, 0.6313393 n.trees=700, shrinkage=0.01 logistic regression - 0.618648 40 / 52
  • Variable importance according to the best model 41 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 42 / 52
  • Contest statistics 101 participants, 84 of them are eligible for prize Two-stage evaluation procedure: validation set and test set (their sizes were unknown during the contest) Validation set size is ≈ 11 000 instances Test set size is ≈ 20 000 instances 43 / 52
  • Preliminary ResultsValidation set19th place (AUC=0.650004) 44 / 52
  • Final ResultsTest set34th place (AUC=0.643346) # Team AUC 1 cointegral* 0.667362 2 Evlampiy* 0.66506 3 alsafr* 0.664527 4 alexeigor* 0.663169 5 keinorhasen 0.660982 6 mmp 0.659914 7 Cutter* 0.659452 8 S-n-D 0.658103 ... ... ... 34 CLL 0.643346 ... ... ... 45 / 52
  • Acknowledgements We would like to thank: the organizers from Yandex for an exciting challenge E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleagues from Kazan Federal University for fruitful discussions and support 46 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 47 / 52
  • References I [Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting // Journal of Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139. [Friedman, 2001] Friedman, J. Greedy Function Approximation: A Gradient Boosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P. 1189-1232. [Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning The Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P. 63-76. [Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG: User behavior as a predictor of a successful search // Proceedings of the third ACM international conference on Web search and data mining. – ACM. – 2010. – P. 221-230. 48 / 52
  • References II [Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining User Web Search Activity with Layered Bayesian Networks or How to Capture a Click in its Context // Proceedings of the Second ACM International Conference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171. [Schapire, 1990] Schapire, R. The strength of weak learnability // Machine Learning. – V.5. – No. 2. – 1990. – P. 197–227. 49 / 52
  • Outline1 Problem Statement2 Features3 Feature Extraction4 Statistical analysis5 Contest Results6 Appendix A. References7 Appendix B. R functions 50 / 52
  • Compute AUC for gbm modelComputeAUC <- function(fit, ntrees, testSet) {require(ROCR)require(foreach)require(gbm)pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))count <- nrow(queryRegions)aucValues <- foreach (i=1:count, .combine="c") %do% { queryId <- queryRegions[i,"QueryID"] regionId <- queryRegions[i,"RegionID"] true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabel m <- mean(true.labels) if (m == 0 | m == 1) { pred <- NA perf <- NA curAUC <- NA } else { gbm.predictions <- predict.gbm(fit, pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,], n.trees=ntrees, type="response") pred <- prediction(gbm.predictions, true.labels) perf <- performance(pred, "auc") curAUC <- perf@y.values [[1]] [1] }curAUC}return (mean(aucValues, na.rm=T))} 51 / 52
  • Tuning AUC for gbm modelTuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500, step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) { require(gbm) require(foreach) require(caret) require(sqldf) FUN <- match.fun(aucfunction) ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step) folds <- createFolds(trainSet$QueryID, foldsNum, T, T) aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% { inTrain <- folds[[i]] cvTrainData <- trainSet[inTrain,] cvTestData <- trainSet[-inTrain,] pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID)) gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution, interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage) foreach(n=ntreesSeq, .combine="rbind") %do% { auc <- FUN(gbmFit, n, cvTestData) (c(n, auc)) } } aucvalues <- as.data.frame(aucvalues) avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1") return (avgAuc)} 52 / 52