The 7 Things I Know About Cyber Security After 25 Years | April 2024
Yandex Relevance Prediction Challenge Overview of CLL Team Solution
1. Yandex Relevance Prediction Challenge
Overview of “CLL” team’s solution
R. Gareev1 , D. Kalyanov2 , A. Shaykhutdinova1 , N. Zhiltsov1
1 Kazan (Volga Region) Federal University
2
10tracks.ru
28 December 2011
1 / 52
2. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
2 / 52
3. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
3 / 52
4. Problem statement
Predict document relevance from user behavior a.k.a
Implicit Relevance Feedback
See also http://imat-relpred.yandex.ru/en for
more details
4 / 52
5. User session example
Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170
5 / 52
6. Labeled data
Given judgements for some pairs of documents and queries:
a document Dj is relevant for a query Qi from a
region R
or
a document Dj is not relevant for a query Qi from a
region R
6 / 52
7. The problem
Given a set Q of search queries, for each (q, R) ∈ Q
provide a sorted list of documents D1 , . . . , Dm that are
relevant to q in the region R
Area Under the ROC Curve (AUC) averaged over all
the test query-pairs is the target evaluation metric
7 / 52
8. AUC score
Consider list of documents: D1 , . . . , Di , . . . Dm
Prefix of length i
(FPR(i), TPR(i)) gives a single point in ROC curve
AUC is the area under ROC curve
AUC = Probability that randomly chosen relevant document
come before randomly chosen non-relevand document
8 / 52
9. Our problem restatement
We consider it as a machine learning task
Using relevance judgements, learn a classifier
H(R, Q, D) that predicts that document D is relevant
to a query Q from a region R
Replace RegionID, QueryID and DocumentID with
related features extracted from click log
Use the classifier H(R, Q, D) to compute a list, sorted
w.r.t. classifier’s certainty scores, for a query Q from a
region R
9 / 52
10. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
10 / 52
11. Features
A feature is a function of (Q, R, D).
Each feature is associated/not associated with its
related region
Types
Document features
Query features
Time-concerned features
11 / 52
12. Document features
1 (Q, D) → Number of occurences of an URL in the SERP list
2 (Q, D) → Number of clicks
3 (Q, D) → Click-through rate
4 (Q, D) → Average position in the click sequence
5 (Q, D) → Average rank in the SERP list
6 (Q, D) → Average rank in the SERP list when URL is clicked
7 (Q, D) → Probability of being last clicked
8 (Q, D) → Probability of being first clicked
12 / 52
13. User session example
Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170
13 / 52
14. Query features
1 (Q) → Average number of clicks in subsession
2 (Q) → Probability of being rewritten (being not last query in
session)
3 (Q) → Probability of being resolved (probability of its results
being last clicked)
14 / 52
15. User session example
Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170
15 / 52
16. Time-concerned features
1 (Q) → Average time to first click
2 (Q, D) → Average time spent reading a document D
16 / 52
17. User session example
Region
Q1 ⇒ 1 2 3 4 5 T = 0
3 T = 10
5 T = 35
1 T = 100
Q2 ⇒ 6 7 8 9 10 T = 130
6 T = 150
9 T = 170
17 / 52
18. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
18 / 52
19. Two phase extraction
1 Normalization
• lookup filtering by ’Important triples’ set
• normalization is specific for each feature
2 Grouping and aggregating
19 / 52
21. Normalization
Converting clicklog entries to a relational table with
the following attributes:
• feature domain attributes, e.g:
• (Q,R,U), (Q,U) for document features
• (Q,R), (Q) for query features
• feature attribute value
Sequential processing ’session-by-session’
• reject spam sessions
• emit values (probably repeated)
21 / 52
26. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
26 / 52
27. Our final ML based solution in a nutshell
Binary classification task for predicting assessors’ labels
26 features extracted from the click log
Gradient Boosted Trees learning model (gbm R
package)
Tuning model’s parameters w.r.t. AUC averaged over
given query-region pairs
Ranking URLs according to the best model’s
probability scores
27 / 52
33. Data Analysis Scheme
1 Given initial training and test sets
2 Partitioning the initial training set into two sets:
• training set (3/4)
• test set (1/4)
32 / 52
34. Data Analysis Scheme
1 Given initial training and test sets
2 Partitioning the initial training set into two sets:
• training set (3/4)
• test set (1/4)
3 Consider the following models:
• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
• Gradient Boosted Trees (Adaboost distribution; exponential loss
function)
• Logistic Regression
32 / 52
35. Data Analysis Scheme
1 Given initial training and test sets
2 Partitioning the initial training set into two sets:
• training set (3/4)
• test set (1/4)
3 Consider the following models:
• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
• Gradient Boosted Trees (Adaboost distribution; exponential loss
function)
• Logistic Regression
4 Learning and tuning parameters w.r.t. the target metric (Area
under the ROC curve) on the training set using 3-fold
cross-validation
32 / 52
36. Data Analysis Scheme
1 Given initial training and test sets
2 Partitioning the initial training set into two sets:
• training set (3/4)
• test set (1/4)
3 Consider the following models:
• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
• Gradient Boosted Trees (Adaboost distribution; exponential loss
function)
• Logistic Regression
4 Learning and tuning parameters w.r.t. the target metric (Area
under the ROC curve) on the training set using 3-fold
cross-validation
5 Obtain the estimates for the target metric on the test set
32 / 52
37. Data Analysis Scheme
1 Given initial training and test sets
2 Partitioning the initial training set into two sets:
• training set (3/4)
• test set (1/4)
3 Consider the following models:
• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
• Gradient Boosted Trees (Adaboost distribution; exponential loss
function)
• Logistic Regression
4 Learning and tuning parameters w.r.t. the target metric (Area
under the ROC curve) on the training set using 3-fold
cross-validation
5 Obtain the estimates for the target metric on the test set
6 Choose the optimal model, refit it on the whole initial training
set and apply it on the initial test set
32 / 52
38. Boosting
[Schapire, 1990]
Given training set (x1 , y1 ), . . . , (xN , yN ),
yi ∈ {−1, +1}
For t = 1, . . . , T
• construct distribution Dt on {1, . . . , N }
• sample examples from it concentrating on the “hardest” ones
• learn a “weak classifier” (at least better than random)
ht : X → {−1, +1}
with error t on Dt :
t = Pi∼Dt (ht (xi ) = yi )
Output the final classifier H as a weighted majority
vote of ht
33 / 52
39. AdaBoost
[Freund & Schapire, 1997]
Constructing Dt :
• D1 (i) = N
1
• given Dt and ht :
Dt (i) e−αt if yi = ht (xi )
Dt+1 (i) = ×
Zt e αt if yi = ht (xi )
where Zt – normalization factor and
1 1− t
αt = ln >0
2 t
Final classifier:
H(x) = sign αt ht (x)
t
34 / 52
40. Gradient boosted trees
[Friedman, 2001]
Stochastic gradient decent optimization of the loss
function
Decision trees model as a weak classifier
Do not require feature normalization
There is no need to handle missing values specifically
Reported good performance in relevance prediction
problems [Piwowarski et al., 2009], [Hassan et al.,
2010] and [Gulin et al., 2011]
35 / 52
41. Gradient boosted trees
gbm R package implementation
There are two available distributions for classification
tasks: Bernoulli and AdaBoost
Three basic parameters: interaction depth (depth of
each tree), number of trees (or iterations) and
shrinkage (learning rate)
36 / 52
42. Logistic regression
glm, stats R package
Preprocess the initial training data – imputing missing
values with the help of bagged trees
Fit the generalized linear model:
1
f (x) = ,
1 + e−z
where z = β0 + β1 x1 + · · · + βk xk
37 / 52
47. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
42 / 52
48. Contest statistics
101 participants, 84 of them are eligible for prize
Two-stage evaluation procedure: validation set and test
set (their sizes were unknown during the contest)
Validation set size is ≈ 11 000 instances
Test set size is ≈ 20 000 instances
43 / 52
50. Final Results
Test set
34th place (AUC=0.643346)
# Team AUC
1 cointegral* 0.667362
2 Evlampiy* 0.66506
3 alsafr* 0.664527
4 alexeigor* 0.663169
5 keinorhasen 0.660982
6 mmp 0.659914
7 Cutter* 0.659452
8 S-n-D 0.658103
... ... ...
34 CLL 0.643346
... ... ...
45 / 52
51. Acknowledgements
We would like to thank:
the organizers from Yandex for an exciting challenge
E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleagues
from Kazan Federal University for fruitful discussions and support
46 / 52
52. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
47 / 52
53. References I
[Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoretic
generalization of on-line learning and an application to boosting // Journal
of Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.
[Friedman, 2001] Friedman, J. Greedy Function Approximation: A Gradient
Boosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.
1189-1232.
[Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning The
Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with
YetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.
63-76.
[Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:
User behavior as a predictor of a successful search // Proceedings of the
third ACM international conference on Web search and data mining. –
ACM. – 2010. – P. 221-230.
48 / 52
54. References II
[Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining User
Web Search Activity with Layered Bayesian Networks or How to Capture a
Click in its Context // Proceedings of the Second ACM International
Conference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.
[Schapire, 1990] Schapire, R. The strength of weak learnability // Machine
Learning. – V.5. – No. 2. – 1990. – P. 197–227.
49 / 52
55. Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
50 / 52