Yandex Relevance Prediction Challenge Overview of CLL Team Solution

Yandex Relevance Prediction Challenge
Overview of “CLL” team’s solution

R. Gareev1 , D. Kalyanov2 , A. Shaykhutdinova1 , N. Zhiltsov1

1 Kazan (Volga Region) Federal University
2
10tracks.ru

28 December 2011

1 / 52

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

2 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



3 / 52

Problem statement

Predict document relevance from user behavior a.k.a
Implicit Relevance Feedback
See also http://imat-relpred.yandex.ru/en for
more details

4 / 52

User session example

Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170

5 / 52

Labeled data

Given judgements for some pairs of documents and queries:

a document Dj is relevant for a query Qi from a
region R
or
a document Dj is not relevant for a query Qi from a
region R

6 / 52

The problem

Given a set Q of search queries, for each (q, R) ∈ Q
provide a sorted list of documents D1 , . . . , Dm that are
relevant to q in the region R
Area Under the ROC Curve (AUC) averaged over all
the test query-pairs is the target evaluation metric

7 / 52

AUC score

Consider list of documents: D1 , . . . , Di , . . . Dm
Preﬁx of length i
(FPR(i), TPR(i)) gives a single point in ROC curve
AUC is the area under ROC curve
AUC = Probability that randomly chosen relevant document
come before randomly chosen non-relevand document
8 / 52

Our problem restatement

We consider it as a machine learning task
Using relevance judgements, learn a classifier
H(R, Q, D) that predicts that document D is relevant
to a query Q from a region R
Replace RegionID, QueryID and DocumentID with
related features extracted from click log
Use the classifier H(R, Q, D) to compute a list, sorted
w.r.t. classifier’s certainty scores, for a query Q from a
region R

9 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



10 / 52

Features

A feature is a function of (Q, R, D).
Each feature is associated/not associated with its
related region

Types
Document features
Query features
Time-concerned features

11 / 52

Document features

1 (Q, D) → Number of occurences of an URL in the SERP list
2 (Q, D) → Number of clicks
3 (Q, D) → Click-through rate
4 (Q, D) → Average position in the click sequence
5 (Q, D) → Average rank in the SERP list
6 (Q, D) → Average rank in the SERP list when URL is clicked
7 (Q, D) → Probability of being last clicked
8 (Q, D) → Probability of being ﬁrst clicked

12 / 52


Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170

13 / 52

Query features

1 (Q) → Average number of clicks in subsession
2 (Q) → Probability of being rewritten (being not last query in
session)
3 (Q) → Probability of being resolved (probability of its results
being last clicked)

14 / 52


Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170

15 / 52

Time-concerned features

1 (Q) → Average time to ﬁrst click
2 (Q, D) → Average time spent reading a document D

16 / 52


Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170

17 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



18 / 52

Two phase extraction

1 Normalization
• lookup ﬁltering by ’Important triples’ set
• normalization is speciﬁc for each feature

2 Grouping and aggregating

19 / 52

Important triples

20 / 52

Normalization

Converting clicklog entries to a relational table with
the following attributes:
• feature domain attributes, e.g:
• (Q,R,U), (Q,U) for document features
• (Q,R), (Q) for query features
• feature attribute value
Sequential processing ’session-by-session’
• reject spam sessions
• emit values (probably repeated)

21 / 52

Normalization example (I)
Click log (with SessionID, TimePassed omitted):
Action QueryID RegionID URLs
Q 174 0 1625 1627 1623 2510 2524
Q 1974 0 2091 17562 1626 1623 1627
C 17562
C 1627
C 1625
C 2510

Intermediate table for ’Average click position’ feature:
QueryID URLID RegionID ClickPosition
1974 17562 0 1
1974 1627 0 2
174 1625 0 1
174 2510 0 2

22 / 52

Normalization example (II)
Click log (sessionID was omitted):
Time QueryID RegionID URLs
0 Q 5 0 99 16 87 39
6 C 84
120 Q 558 0 84 5043 5041 5039
125 Q 8768 0 74672 74661 74674 74671
145 C 74661

Intermediate table for ’Time to ﬁrst click’ feature:
QueryID RegionID FirstClickTime
5 0 6
8768 0 20

23 / 52

Aggregation example (by triple)

24 / 52

Aggregation example (by QU-pair)

25 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



26 / 52

Our ﬁnal ML based solution in a nutshell

Binary classiﬁcation task for predicting assessors’ labels
26 features extracted from the click log
Gradient Boosted Trees learning model (gbm R
package)
Tuning model’s parameters w.r.t. AUC averaged over
given query-region pairs
Ranking URLs according to the best model’s
probability scores

27 / 52

Training data

28 / 52

Training data
Target values

29 / 52

Training data
Feature values

30 / 52

Training data
Missing values

31 / 52

Data Analysis Scheme
1 Given initial training and test sets

32 / 52

2 Partitioning the initial training set into two sets:
• training set (3/4)
• test set (1/4)

32 / 52

• test set (1/4)
3 Consider the following models:
• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
• Gradient Boosted Trees (Adaboost distribution; exponential loss
function)
• Logistic Regression

32 / 52

• test set (1/4)
function)
4 Learning and tuning parameters w.r.t. the target metric (Area
under the ROC curve) on the training set using 3-fold
cross-validation

32 / 52

• test set (1/4)
function)
cross-validation
5 Obtain the estimates for the target metric on the test set

32 / 52

• test set (1/4)
function)
cross-validation
5 Obtain the estimates for the target metric on the test set
6 Choose the optimal model, reﬁt it on the whole initial training
set and apply it on the initial test set
32 / 52

Boosting
[Schapire, 1990]

Given training set (x1 , y1 ), . . . , (xN , yN ),
yi ∈ {−1, +1}
For t = 1, . . . , T
• construct distribution Dt on {1, . . . , N }
• sample examples from it concentrating on the “hardest” ones
• learn a “weak classifier” (at least better than random)

ht : X → {−1, +1}

with error t on Dt :

t = Pi∼Dt (ht (xi ) = yi )

Output the final classifier H as a weighted majority
vote of ht
33 / 52

AdaBoost
[Freund & Schapire, 1997]
Constructing Dt :
• D1 (i) = N
1

• given Dt and ht :

Dt (i) e−αt if yi = ht (xi )
Dt+1 (i) = ×
Zt e αt if yi = ht (xi )
where Zt – normalization factor and
1 1− t
αt = ln >0
2 t

Final classiﬁer:

H(x) = sign αt ht (x)
t
34 / 52

Gradient boosted trees
[Friedman, 2001]

Stochastic gradient decent optimization of the loss
function
Decision trees model as a weak classiﬁer
Do not require feature normalization
There is no need to handle missing values speciﬁcally
Reported good performance in relevance prediction
problems [Piwowarski et al., 2009], [Hassan et al.,
2010] and [Gulin et al., 2011]

35 / 52

Gradient boosted trees
gbm R package implementation

There are two available distributions for classiﬁcation
tasks: Bernoulli and AdaBoost
Three basic parameters: interaction depth (depth of
each tree), number of trees (or iterations) and
shrinkage (learning rate)

36 / 52

Logistic regression
glm, stats R package

Preprocess the initial training data – imputing missing
values with the help of bagged trees
Fit the generalized linear model:
1
f (x) = ,
1 + e−z
where z = β0 + β1 x1 + · · · + βk xk

37 / 52

Tuning gbmbernoulli model

3-fold CV estimate of AUC for the optimal parameters: 0.6457435

38 / 52

Tuning gbmadaboost model

3-fold CV estimate of AUC for the optimal parameters: 0.6455384

39 / 52

Comparative performance of three optimal models
Test error estimates

Model Optimal parameter values Test estimate of AUC
gbmbernoulli interaction.depth=2, 0.6324717
n.trees=500,
shrinkage=0.01
gbmadaboost interaction.depth=4, 0.6313393
n.trees=700,
shrinkage=0.01
logistic regression - 0.618648

40 / 52

Variable importance according to the best model

41 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



42 / 52

Contest statistics

101 participants, 84 of them are eligible for prize
Two-stage evaluation procedure: validation set and test
set (their sizes were unknown during the contest)
Validation set size is ≈ 11 000 instances
Test set size is ≈ 20 000 instances

43 / 52

Preliminary Results
Validation set

19th place (AUC=0.650004)

44 / 52

Final Results
Test set

34th place (AUC=0.643346)

# Team AUC
1 cointegral* 0.667362
2 Evlampiy* 0.66506
3 alsafr* 0.664527
4 alexeigor* 0.663169
5 keinorhasen 0.660982
6 mmp 0.659914
7 Cutter* 0.659452
8 S-n-D 0.658103
... ... ...
34 CLL 0.643346
... ... ...

45 / 52

Acknowledgements

We would like to thank:
the organizers from Yandex for an exciting challenge
E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleagues
from Kazan Federal University for fruitful discussions and support

46 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



47 / 52

References I

[Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoretic
generalization of on-line learning and an application to boosting // Journal
of Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.
[Friedman, 2001] Friedman, J. Greedy Function Approximation: A Gradient
Boosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.
1189-1232.
[Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning The
Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with
YetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.
63-76.
[Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:
User behavior as a predictor of a successful search // Proceedings of the
third ACM international conference on Web search and data mining. –
ACM. – 2010. – P. 221-230.

48 / 52

References II

[Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining User
Web Search Activity with Layered Bayesian Networks or How to Capture a
Click in its Context // Proceedings of the Second ACM International
Conference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.

[Schapire, 1990] Schapire, R. The strength of weak learnability // Machine
Learning. – V.5. – No. 2. – 1990. – P. 197–227.

49 / 52

Outline

1 Problem Statement

2 Features



5 Contest Results



50 / 52

Compute AUC for gbm model
ComputeAUC <- function(fit, ntrees, testSet) {
require(ROCR)
require(foreach)
require(gbm)

pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))
queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))
count <- nrow(queryRegions)

aucValues <- foreach (i=1:count, .combine="c") %do% {
queryId <- queryRegions[i,"QueryID"]
regionId <- queryRegions[i,"RegionID"]
true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabel
m <- mean(true.labels)
if (m == 0 | m == 1) {
pred <- NA
perf <- NA
curAUC <- NA
}
else {
gbm.predictions <-
predict.gbm(fit,
pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,],
n.trees=ntrees, type="response")
pred <- prediction(gbm.predictions, true.labels)
perf <- performance(pred, "auc")
curAUC <- perf@y.values [[1]] [1]
}
curAUC
}
return (mean(aucValues, na.rm=T))
}

51 / 52

Tuning AUC for gbm model
TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500,
step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) {
require(gbm)
require(foreach)
require(caret)
require(sqldf)
FUN <- match.fun(aucfunction)
ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step)

folds <- createFolds(trainSet$QueryID, foldsNum, T, T)
aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% {
inTrain <- folds[[i]]
cvTrainData <- trainSet[inTrain,]
cvTestData <- trainSet[-inTrain,]
pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID))

gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution,
interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage)
foreach(n=ntreesSeq, .combine="rbind") %do% {
auc <- FUN(gbmFit, n, cvTestData)
(c(n, auc))
}
}
aucvalues <- as.data.frame(aucvalues)
avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1")
return (avgAuc)
}

52 / 52

Yandex Relevance Prediction Challenge Overview of CLL Team Solution

Recommended

Recommended

More Related Content

Similar to Yandex Relevance Prediction Challenge Overview of CLL Team Solution

Similar to Yandex Relevance Prediction Challenge Overview of CLL Team Solution (20)

Recently uploaded

Recently uploaded (20)

Yandex Relevance Prediction Challenge Overview of CLL Team Solution