SlideShare a Scribd company logo
1 of 57
Download to read offline
Yandex Relevance Prediction Challenge
                  Overview of “CLL” team’s solution


R. Gareev1 , D. Kalyanov2 , A. Shaykhutdinova1 , N. Zhiltsov1

            1   Kazan (Volga Region) Federal University
                             2
                                 10tracks.ru

                        28 December 2011




                                                                1 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            2 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            3 / 52
Problem statement



   Predict document relevance from user behavior a.k.a
    Implicit Relevance Feedback
   See also http://imat-relpred.yandex.ru/en for
   more details




                                                         4 / 52
User session example


           Region
              Q1 ⇒ 1 2 3 4 5 T = 0
                   3          T = 10
                   5          T = 35
                   1          T = 100
              Q2 ⇒ 6 7 8 9 10 T = 130
                   6          T = 150
                   9          T = 170




                                        5 / 52
Labeled data


Given judgements for some pairs of documents and queries:

   a document Dj is relevant for a query Qi from a
   region R
   or
   a document Dj is not relevant for a query Qi from a
   region R




                                                       6 / 52
The problem



   Given a set Q of search queries, for each (q, R) ∈ Q
   provide a sorted list of documents D1 , . . . , Dm that are
   relevant to q in the region R
   Area Under the ROC Curve (AUC) averaged over all
   the test query-pairs is the target evaluation metric




                                                           7 / 52
AUC score




   Consider list of documents: D1 , . . . , Di , . . . Dm
                                  Prefix of length i
   (FPR(i), TPR(i)) gives a single point in ROC curve
   AUC is the area under ROC curve
   AUC = Probability that randomly chosen relevant document
   come before randomly chosen non-relevand document
                                                              8 / 52
Our problem restatement

   We consider it as a machine learning task
   Using relevance judgements, learn a classifier
   H(R, Q, D) that predicts that document D is relevant
   to a query Q from a region R
   Replace RegionID, QueryID and DocumentID with
   related features extracted from click log
   Use the classifier H(R, Q, D) to compute a list, sorted
   w.r.t. classifier’s certainty scores, for a query Q from a
   region R


                                                          9 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            10 / 52
Features


   A feature is a function of (Q, R, D).
   Each feature is associated/not associated with its
   related region

Types
   Document features
   Query features
   Time-concerned features



                                                        11 / 52
Document features


  1   (Q, D) →   Number of occurences of an URL in the SERP list
  2   (Q, D) →   Number of clicks
  3   (Q, D) →   Click-through rate
  4   (Q, D) →   Average position in the click sequence
  5   (Q, D) →   Average rank in the SERP list
  6   (Q, D) →   Average rank in the SERP list when URL is clicked
  7   (Q, D) →   Probability of being last clicked
  8   (Q, D) →   Probability of being first clicked




                                                                 12 / 52
User session example


           Region
              Q1 ⇒ 1 2 3 4 5 T = 0
                   3          T = 10
                   5          T = 35
                   1          T = 100
              Q2 ⇒ 6 7 8 9 10 T = 130
                   6          T = 150
                   9          T = 170




                                        13 / 52
Query features



  1   (Q) → Average number of clicks in subsession
  2   (Q) → Probability of being rewritten (being not last query in
      session)
  3   (Q) → Probability of being resolved (probability of its results
      being last clicked)




                                                                        14 / 52
User session example


           Region
              Q1 ⇒ 1 2 3 4 5 T = 0
                   3          T = 10
                   5          T = 35
                   1          T = 100
              Q2 ⇒ 6 7 8 9 10 T = 130
                   6          T = 150
                   9          T = 170




                                        15 / 52
Time-concerned features




  1   (Q) → Average time to first click
  2   (Q, D) → Average time spent reading a document D




                                                         16 / 52
User session example


           Region
              Q1 ⇒ 1 2 3 4 5 T = 0
                   3          T = 10
                   5          T = 35
                   1          T = 100
              Q2 ⇒ 6 7 8 9 10 T = 130
                   6          T = 150
                   9          T = 170




                                        17 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            18 / 52
Two phase extraction



  1   Normalization
        • lookup filtering by ’Important triples’ set
        • normalization is specific for each feature

  2   Grouping and aggregating




                                                       19 / 52
Important triples




                    20 / 52
Normalization


   Converting clicklog entries to a relational table with
   the following attributes:
     • feature domain attributes, e.g:
          • (Q,R,U), (Q,U) for document features
          • (Q,R), (Q) for query features
     • feature attribute value
   Sequential processing ’session-by-session’
     • reject spam sessions
     • emit values (probably repeated)




                                                            21 / 52
Normalization example (I)
Click log (with SessionID, TimePassed omitted):
 Action QueryID RegionID URLs
 Q               174           0     1625    1627          1623   2510   2524
 Q              1974           0     2091 17562            1626   1623   1627
 C             17562
 C              1627
 C              1625
 C              2510



Intermediate table for ’Average click position’ feature:
  QueryID URLID RegionID ClickPosition
  1974         17562             0                  1
  1974          1627             0                  2
  174           1625             0                  1
  174           2510             0                  2


                                                                                22 / 52
Normalization example (II)
Click log (sessionID was omitted):
 Time         QueryID RegionID       URLs
     0 Q             5         0     99 16 87 39
     6 C            84
   120 Q           558         0     84 5043 5041 5039
   125 Q          8768         0     74672 74661 74674 74671
   145 C         74661


Intermediate table for ’Time to first click’ feature:
 QueryID RegionID FirstClickTime
        5           0                 6
     8768           0                20


                                                               23 / 52
Aggregation example (by triple)




                                  24 / 52
Aggregation example (by QU-pair)




                                   25 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            26 / 52
Our final ML based solution in a nutshell

   Binary classification task for predicting assessors’ labels
   26 features extracted from the click log
   Gradient Boosted Trees learning model (gbm R
   package)
   Tuning model’s parameters w.r.t. AUC averaged over
   given query-region pairs
   Ranking URLs according to the best model’s
   probability scores


                                                          27 / 52
Training data




                28 / 52
Training data
Target values




                29 / 52
Training data
Feature values




                 30 / 52
Training data
Missing values




                 31 / 52
Data Analysis Scheme
  1   Given initial training and test sets




                                             32 / 52
Data Analysis Scheme
  1   Given initial training and test sets
  2   Partitioning the initial training set into two sets:
        • training set (3/4)
        • test set (1/4)




                                                             32 / 52
Data Analysis Scheme
  1   Given initial training and test sets
  2   Partitioning the initial training set into two sets:
        • training set (3/4)
        • test set (1/4)
  3   Consider the following models:
        • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
        • Gradient Boosted Trees (Adaboost distribution; exponential loss
           function)
        • Logistic Regression




                                                                        32 / 52
Data Analysis Scheme
  1   Given initial training and test sets
  2   Partitioning the initial training set into two sets:
        • training set (3/4)
        • test set (1/4)
  3   Consider the following models:
        • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
        • Gradient Boosted Trees (Adaboost distribution; exponential loss
           function)
        • Logistic Regression
  4   Learning and tuning parameters w.r.t. the target metric (Area
      under the ROC curve) on the training set using 3-fold
      cross-validation




                                                                        32 / 52
Data Analysis Scheme
  1   Given initial training and test sets
  2   Partitioning the initial training set into two sets:
        • training set (3/4)
        • test set (1/4)
  3   Consider the following models:
        • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
        • Gradient Boosted Trees (Adaboost distribution; exponential loss
           function)
        • Logistic Regression
  4   Learning and tuning parameters w.r.t. the target metric (Area
      under the ROC curve) on the training set using 3-fold
      cross-validation
  5   Obtain the estimates for the target metric on the test set


                                                                        32 / 52
Data Analysis Scheme
  1   Given initial training and test sets
  2   Partitioning the initial training set into two sets:
        • training set (3/4)
        • test set (1/4)
  3   Consider the following models:
        • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)
        • Gradient Boosted Trees (Adaboost distribution; exponential loss
           function)
        • Logistic Regression
  4   Learning and tuning parameters w.r.t. the target metric (Area
      under the ROC curve) on the training set using 3-fold
      cross-validation
  5   Obtain the estimates for the target metric on the test set
  6   Choose the optimal model, refit it on the whole initial training
      set and apply it on the initial test set
                                                                        32 / 52
Boosting
[Schapire, 1990]

      Given training set (x1 , y1 ), . . . , (xN , yN ),
      yi ∈ {−1, +1}
      For t = 1, . . . , T
         • construct distribution Dt on {1, . . . , N }
         • sample examples from it concentrating on the “hardest” ones
         • learn a “weak classifier” (at least better than random)

                                          ht : X → {−1, +1}

           with error   t   on Dt :

                                      t   = Pi∼Dt (ht (xi ) = yi )

      Output the final classifier H as a weighted majority
      vote of ht
                                                                         33 / 52
AdaBoost
[Freund & Schapire, 1997]
      Constructing Dt :
        • D1 (i) = N
                   1

        • given Dt and ht :

                                  Dt (i)     e−αt        if yi = ht (xi )
                     Dt+1 (i) =          ×
                                   Zt         e αt       if yi = ht (xi )
           where Zt – normalization factor and
                                      1      1−      t
                              αt =      ln                >0
                                      2          t

      Final classifier:

                      H(x) = sign                    αt ht (x)
                                             t
                                                                            34 / 52
Gradient boosted trees
[Friedman, 2001]



      Stochastic gradient decent optimization of the loss
      function
      Decision trees model as a weak classifier
      Do not require feature normalization
      There is no need to handle missing values specifically
      Reported good performance in relevance prediction
      problems [Piwowarski et al., 2009], [Hassan et al.,
      2010] and [Gulin et al., 2011]


                                                          35 / 52
Gradient boosted trees
gbm R package implementation




     There are two available distributions for classification
     tasks: Bernoulli and AdaBoost
     Three basic parameters: interaction depth (depth of
     each tree), number of trees (or iterations) and
     shrinkage (learning rate)




                                                           36 / 52
Logistic regression
glm, stats R package




      Preprocess the initial training data – imputing missing
      values with the help of bagged trees
      Fit the generalized linear model:
                                      1
                         f (x) =           ,
                                   1 + e−z
      where z = β0 + β1 x1 + · · · + βk xk



                                                           37 / 52
Tuning gbmbernoulli model




3-fold CV estimate of AUC for the optimal parameters: 0.6457435

                                                             38 / 52
Tuning gbmadaboost model




3-fold CV estimate of AUC for the optimal parameters: 0.6455384

                                                             39 / 52
Comparative performance of three optimal models
Test error estimates




   Model                 Optimal parameter values   Test estimate of AUC
   gbmbernoulli          interaction.depth=2,            0.6324717
                         n.trees=500,
                         shrinkage=0.01
   gbmadaboost           interaction.depth=4,            0.6313393
                         n.trees=700,
                         shrinkage=0.01
   logistic regression   -                                0.618648




                                                                       40 / 52
Variable importance according to the best model




                                              41 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            42 / 52
Contest statistics



    101 participants, 84 of them are eligible for prize
    Two-stage evaluation procedure: validation set and test
    set (their sizes were unknown during the contest)
    Validation set size is ≈ 11 000 instances
    Test set size is ≈ 20 000 instances




                                                        43 / 52
Preliminary Results
Validation set

19th place (AUC=0.650004)




                            44 / 52
Final Results
Test set

34th place (AUC=0.643346)

                  #     Team          AUC
                   1    cointegral*   0.667362
                   2    Evlampiy*     0.66506
                   3    alsafr*       0.664527
                   4    alexeigor*    0.663169
                   5    keinorhasen   0.660982
                   6    mmp           0.659914
                   7    Cutter*       0.659452
                   8    S-n-D         0.658103
                  ...   ...           ...
                  34    CLL           0.643346
                  ...   ...           ...


                                                 45 / 52
Acknowledgements



                     We would like to thank:
   the organizers from Yandex for an exciting challenge
   E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleagues
   from Kazan Federal University for fruitful discussions and support




                                                                 46 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            47 / 52
References I

   [Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoretic
   generalization of on-line learning and an application to boosting // Journal
   of Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.
   [Friedman, 2001] Friedman, J. Greedy Function Approximation: A Gradient
   Boosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.
   1189-1232.
   [Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning The
   Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with
   YetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.
   63-76.
   [Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:
   User behavior as a predictor of a successful search // Proceedings of the
   third ACM international conference on Web search and data mining. –
   ACM. – 2010. – P. 221-230.


                                                                           48 / 52
References II



    [Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining User
    Web Search Activity with Layered Bayesian Networks or How to Capture a
    Click in its Context // Proceedings of the Second ACM International
    Conference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.

    [Schapire, 1990] Schapire, R. The strength of weak learnability // Machine
    Learning. – V.5. – No. 2. – 1990. – P. 197–227.




                                                                           49 / 52
Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions


                            50 / 52
Compute AUC for gbm model
ComputeAUC <- function(fit, ntrees, testSet) {
require(ROCR)
require(foreach)
require(gbm)

pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))
queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))
count <- nrow(queryRegions)

aucValues <- foreach (i=1:count, .combine="c") %do% {
  queryId <- queryRegions[i,"QueryID"]
  regionId <- queryRegions[i,"RegionID"]
  true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabel
  m <- mean(true.labels)
  if (m == 0 | m == 1) {
    pred <- NA
    perf <- NA
    curAUC <- NA
  }
  else {
    gbm.predictions <-
       predict.gbm(fit,
       pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,],
       n.trees=ntrees, type="response")
  pred <- prediction(gbm.predictions, true.labels)
  perf <- performance(pred, "auc")
  curAUC <- perf@y.values [[1]] [1]
  }
curAUC
}
return (mean(aucValues, na.rm=T))
}

                                                                                                       51 / 52
Tuning AUC for gbm model
TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500,
 step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) {
  require(gbm)
  require(foreach)
  require(caret)
  require(sqldf)
  FUN <- match.fun(aucfunction)
  ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step)

    folds <- createFolds(trainSet$QueryID, foldsNum, T, T)
    aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% {
      inTrain <- folds[[i]]
      cvTrainData <- trainSet[inTrain,]
      cvTestData <- trainSet[-inTrain,]
      pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID))

     gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution,
                   interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage)
     foreach(n=ntreesSeq, .combine="rbind") %do% {
     auc <- FUN(gbmFit, n, cvTestData)
       (c(n, auc))
     }
    }
    aucvalues <- as.data.frame(aucvalues)
    avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1")
    return (avgAuc)
}




                                                                                                        52 / 52

More Related Content

Similar to Yandex Relevance Prediction Challenge Overview of CLL Team Solution

Chapter 3 retrieval evaluation
Chapter 3 retrieval evaluationChapter 3 retrieval evaluation
Chapter 3 retrieval evaluationAsimGardezi
 
Quality By Design
Quality By DesignQuality By Design
Quality By Designrealmayank
 
BABOKv2_SummaryTaskAssociationMatrix_v0_03
BABOKv2_SummaryTaskAssociationMatrix_v0_03BABOKv2_SummaryTaskAssociationMatrix_v0_03
BABOKv2_SummaryTaskAssociationMatrix_v0_03Alan Maxwell, CBAP
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryDataWorks Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Facility Layout in production management
Facility Layout in production managementFacility Layout in production management
Facility Layout in production managementJoshua Miranda
 
CS 354 Final Exam Review
CS 354 Final Exam ReviewCS 354 Final Exam Review
CS 354 Final Exam ReviewMark Kilgard
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentationNoha Elprince
 
Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"LogeekNightUkraine
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slidesMasud Rahman
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
 
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...QAFest
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 

Similar to Yandex Relevance Prediction Challenge Overview of CLL Team Solution (20)

Chapter 3 retrieval evaluation
Chapter 3 retrieval evaluationChapter 3 retrieval evaluation
Chapter 3 retrieval evaluation
 
Solr sparse faceting
Solr sparse facetingSolr sparse faceting
Solr sparse faceting
 
Quality By Design
Quality By DesignQuality By Design
Quality By Design
 
BABOKv2_SummaryTaskAssociationMatrix_v0_03
BABOKv2_SummaryTaskAssociationMatrix_v0_03BABOKv2_SummaryTaskAssociationMatrix_v0_03
BABOKv2_SummaryTaskAssociationMatrix_v0_03
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Facility Layout in production management
Facility Layout in production managementFacility Layout in production management
Facility Layout in production management
 
CS 354 Final Exam Review
CS 354 Final Exam ReviewCS 354 Final Exam Review
CS 354 Final Exam Review
 
ERTS UNIT 3.ppt
ERTS UNIT 3.pptERTS UNIT 3.ppt
ERTS UNIT 3.ppt
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"Yevhen Tatarynov "From POC to High-Performance .NET applications"
Yevhen Tatarynov "From POC to High-Performance .NET applications"
 
ACER-ASE2017-slides
ACER-ASE2017-slidesACER-ASE2017-slides
ACER-ASE2017-slides
 
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...
 
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
QA Fest 2018. Никита Кричко. Методология использования машинного обучения в н...
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
Project management
Project managementProject management
Project management
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Layout10
Layout10Layout10
Layout10
 

Recently uploaded

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Yandex Relevance Prediction Challenge Overview of CLL Team Solution

  • 1. Yandex Relevance Prediction Challenge Overview of “CLL” team’s solution R. Gareev1 , D. Kalyanov2 , A. Shaykhutdinova1 , N. Zhiltsov1 1 Kazan (Volga Region) Federal University 2 10tracks.ru 28 December 2011 1 / 52
  • 2. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 2 / 52
  • 3. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 3 / 52
  • 4. Problem statement Predict document relevance from user behavior a.k.a Implicit Relevance Feedback See also http://imat-relpred.yandex.ru/en for more details 4 / 52
  • 5. User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 5 / 52
  • 6. Labeled data Given judgements for some pairs of documents and queries: a document Dj is relevant for a query Qi from a region R or a document Dj is not relevant for a query Qi from a region R 6 / 52
  • 7. The problem Given a set Q of search queries, for each (q, R) ∈ Q provide a sorted list of documents D1 , . . . , Dm that are relevant to q in the region R Area Under the ROC Curve (AUC) averaged over all the test query-pairs is the target evaluation metric 7 / 52
  • 8. AUC score Consider list of documents: D1 , . . . , Di , . . . Dm Prefix of length i (FPR(i), TPR(i)) gives a single point in ROC curve AUC is the area under ROC curve AUC = Probability that randomly chosen relevant document come before randomly chosen non-relevand document 8 / 52
  • 9. Our problem restatement We consider it as a machine learning task Using relevance judgements, learn a classifier H(R, Q, D) that predicts that document D is relevant to a query Q from a region R Replace RegionID, QueryID and DocumentID with related features extracted from click log Use the classifier H(R, Q, D) to compute a list, sorted w.r.t. classifier’s certainty scores, for a query Q from a region R 9 / 52
  • 10. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 10 / 52
  • 11. Features A feature is a function of (Q, R, D). Each feature is associated/not associated with its related region Types Document features Query features Time-concerned features 11 / 52
  • 12. Document features 1 (Q, D) → Number of occurences of an URL in the SERP list 2 (Q, D) → Number of clicks 3 (Q, D) → Click-through rate 4 (Q, D) → Average position in the click sequence 5 (Q, D) → Average rank in the SERP list 6 (Q, D) → Average rank in the SERP list when URL is clicked 7 (Q, D) → Probability of being last clicked 8 (Q, D) → Probability of being first clicked 12 / 52
  • 13. User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 13 / 52
  • 14. Query features 1 (Q) → Average number of clicks in subsession 2 (Q) → Probability of being rewritten (being not last query in session) 3 (Q) → Probability of being resolved (probability of its results being last clicked) 14 / 52
  • 15. User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 15 / 52
  • 16. Time-concerned features 1 (Q) → Average time to first click 2 (Q, D) → Average time spent reading a document D 16 / 52
  • 17. User session example Region Q1 ⇒ 1 2 3 4 5 T = 0 3 T = 10 5 T = 35 1 T = 100 Q2 ⇒ 6 7 8 9 10 T = 130 6 T = 150 9 T = 170 17 / 52
  • 18. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 18 / 52
  • 19. Two phase extraction 1 Normalization • lookup filtering by ’Important triples’ set • normalization is specific for each feature 2 Grouping and aggregating 19 / 52
  • 20. Important triples 20 / 52
  • 21. Normalization Converting clicklog entries to a relational table with the following attributes: • feature domain attributes, e.g: • (Q,R,U), (Q,U) for document features • (Q,R), (Q) for query features • feature attribute value Sequential processing ’session-by-session’ • reject spam sessions • emit values (probably repeated) 21 / 52
  • 22. Normalization example (I) Click log (with SessionID, TimePassed omitted): Action QueryID RegionID URLs Q 174 0 1625 1627 1623 2510 2524 Q 1974 0 2091 17562 1626 1623 1627 C 17562 C 1627 C 1625 C 2510 Intermediate table for ’Average click position’ feature: QueryID URLID RegionID ClickPosition 1974 17562 0 1 1974 1627 0 2 174 1625 0 1 174 2510 0 2 22 / 52
  • 23. Normalization example (II) Click log (sessionID was omitted): Time QueryID RegionID URLs 0 Q 5 0 99 16 87 39 6 C 84 120 Q 558 0 84 5043 5041 5039 125 Q 8768 0 74672 74661 74674 74671 145 C 74661 Intermediate table for ’Time to first click’ feature: QueryID RegionID FirstClickTime 5 0 6 8768 0 20 23 / 52
  • 24. Aggregation example (by triple) 24 / 52
  • 25. Aggregation example (by QU-pair) 25 / 52
  • 26. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 26 / 52
  • 27. Our final ML based solution in a nutshell Binary classification task for predicting assessors’ labels 26 features extracted from the click log Gradient Boosted Trees learning model (gbm R package) Tuning model’s parameters w.r.t. AUC averaged over given query-region pairs Ranking URLs according to the best model’s probability scores 27 / 52
  • 28. Training data 28 / 52
  • 32. Data Analysis Scheme 1 Given initial training and test sets 32 / 52
  • 33. Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 32 / 52
  • 34. Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 32 / 52
  • 35. Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 32 / 52
  • 36. Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 5 Obtain the estimates for the target metric on the test set 32 / 52
  • 37. Data Analysis Scheme 1 Given initial training and test sets 2 Partitioning the initial training set into two sets: • training set (3/4) • test set (1/4) 3 Consider the following models: • Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function) • Gradient Boosted Trees (Adaboost distribution; exponential loss function) • Logistic Regression 4 Learning and tuning parameters w.r.t. the target metric (Area under the ROC curve) on the training set using 3-fold cross-validation 5 Obtain the estimates for the target metric on the test set 6 Choose the optimal model, refit it on the whole initial training set and apply it on the initial test set 32 / 52
  • 38. Boosting [Schapire, 1990] Given training set (x1 , y1 ), . . . , (xN , yN ), yi ∈ {−1, +1} For t = 1, . . . , T • construct distribution Dt on {1, . . . , N } • sample examples from it concentrating on the “hardest” ones • learn a “weak classifier” (at least better than random) ht : X → {−1, +1} with error t on Dt : t = Pi∼Dt (ht (xi ) = yi ) Output the final classifier H as a weighted majority vote of ht 33 / 52
  • 39. AdaBoost [Freund & Schapire, 1997] Constructing Dt : • D1 (i) = N 1 • given Dt and ht : Dt (i) e−αt if yi = ht (xi ) Dt+1 (i) = × Zt e αt if yi = ht (xi ) where Zt – normalization factor and 1 1− t αt = ln >0 2 t Final classifier: H(x) = sign αt ht (x) t 34 / 52
  • 40. Gradient boosted trees [Friedman, 2001] Stochastic gradient decent optimization of the loss function Decision trees model as a weak classifier Do not require feature normalization There is no need to handle missing values specifically Reported good performance in relevance prediction problems [Piwowarski et al., 2009], [Hassan et al., 2010] and [Gulin et al., 2011] 35 / 52
  • 41. Gradient boosted trees gbm R package implementation There are two available distributions for classification tasks: Bernoulli and AdaBoost Three basic parameters: interaction depth (depth of each tree), number of trees (or iterations) and shrinkage (learning rate) 36 / 52
  • 42. Logistic regression glm, stats R package Preprocess the initial training data – imputing missing values with the help of bagged trees Fit the generalized linear model: 1 f (x) = , 1 + e−z where z = β0 + β1 x1 + · · · + βk xk 37 / 52
  • 43. Tuning gbmbernoulli model 3-fold CV estimate of AUC for the optimal parameters: 0.6457435 38 / 52
  • 44. Tuning gbmadaboost model 3-fold CV estimate of AUC for the optimal parameters: 0.6455384 39 / 52
  • 45. Comparative performance of three optimal models Test error estimates Model Optimal parameter values Test estimate of AUC gbmbernoulli interaction.depth=2, 0.6324717 n.trees=500, shrinkage=0.01 gbmadaboost interaction.depth=4, 0.6313393 n.trees=700, shrinkage=0.01 logistic regression - 0.618648 40 / 52
  • 46. Variable importance according to the best model 41 / 52
  • 47. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 42 / 52
  • 48. Contest statistics 101 participants, 84 of them are eligible for prize Two-stage evaluation procedure: validation set and test set (their sizes were unknown during the contest) Validation set size is ≈ 11 000 instances Test set size is ≈ 20 000 instances 43 / 52
  • 49. Preliminary Results Validation set 19th place (AUC=0.650004) 44 / 52
  • 50. Final Results Test set 34th place (AUC=0.643346) # Team AUC 1 cointegral* 0.667362 2 Evlampiy* 0.66506 3 alsafr* 0.664527 4 alexeigor* 0.663169 5 keinorhasen 0.660982 6 mmp 0.659914 7 Cutter* 0.659452 8 S-n-D 0.658103 ... ... ... 34 CLL 0.643346 ... ... ... 45 / 52
  • 51. Acknowledgements We would like to thank: the organizers from Yandex for an exciting challenge E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleagues from Kazan Federal University for fruitful discussions and support 46 / 52
  • 52. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 47 / 52
  • 53. References I [Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting // Journal of Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139. [Friedman, 2001] Friedman, J. Greedy Function Approximation: A Gradient Boosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P. 1189-1232. [Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning The Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P. 63-76. [Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG: User behavior as a predictor of a successful search // Proceedings of the third ACM international conference on Web search and data mining. – ACM. – 2010. – P. 221-230. 48 / 52
  • 54. References II [Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining User Web Search Activity with Layered Bayesian Networks or How to Capture a Click in its Context // Proceedings of the Second ACM International Conference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171. [Schapire, 1990] Schapire, R. The strength of weak learnability // Machine Learning. – V.5. – No. 2. – 1990. – P. 197–227. 49 / 52
  • 55. Outline 1 Problem Statement 2 Features 3 Feature Extraction 4 Statistical analysis 5 Contest Results 6 Appendix A. References 7 Appendix B. R functions 50 / 52
  • 56. Compute AUC for gbm model ComputeAUC <- function(fit, ntrees, testSet) { require(ROCR) require(foreach) require(gbm) pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel)) queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID))) count <- nrow(queryRegions) aucValues <- foreach (i=1:count, .combine="c") %do% { queryId <- queryRegions[i,"QueryID"] regionId <- queryRegions[i,"RegionID"] true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabel m <- mean(true.labels) if (m == 0 | m == 1) { pred <- NA perf <- NA curAUC <- NA } else { gbm.predictions <- predict.gbm(fit, pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,], n.trees=ntrees, type="response") pred <- prediction(gbm.predictions, true.labels) perf <- performance(pred, "auc") curAUC <- perf@y.values [[1]] [1] } curAUC } return (mean(aucValues, na.rm=T)) } 51 / 52
  • 57. Tuning AUC for gbm model TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500, step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) { require(gbm) require(foreach) require(caret) require(sqldf) FUN <- match.fun(aucfunction) ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step) folds <- createFolds(trainSet$QueryID, foldsNum, T, T) aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% { inTrain <- folds[[i]] cvTrainData <- trainSet[inTrain,] cvTestData <- trainSet[-inTrain,] pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID)) gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution, interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage) foreach(n=ntreesSeq, .combine="rbind") %do% { auc <- FUN(gbmFit, n, cvTestData) (c(n, auc)) } } aucvalues <- as.data.frame(aucvalues) avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1") return (avgAuc) } 52 / 52