Learning to Rank   (part 1) NESCAI 2008 Tutorial Yisong Yue Cornell University
Booming Search Industry
Goals for this Tutorial Basics of information retrieval What machine learning contributes New challenges to address New insights on developing ML algorithms
(Soft) Prerequisites Basic knowledge of ML algorithms Support Vector Machines  Neural Nets Decision Trees Boosting Etc… Will introduce IR concepts as needed
Outline (Part 1) Conventional IR Methods (no learning) 1970s to 1990s Ordinal Regression 1994 onwards Optimizing Rank-Based Measures 2005 to present
Outline (Part 2) Effectively collecting training data E.g., interpreting clickthrough data Beyond independent relevance E.g., diversity Summary & Discussion
Disclaimer This talk is very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods
Disclaimer This talk is very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods Information Retrieval Broader than the scope of this talk Deals with more sophisticated modeling questions Will see more interplay between IR and ML in Part 2
Brief Overview of IR Predated the internet As We May Think  by Vannevar Bush (1945) Active research topic by the 1960’s Vector Space Model (1970s) Probabilistic Models (1980s)  Introduction to Information Retrieval   (2008) C. Manning, P. Raghavan & H. Schütze
Basic Approach to IR Given query q and set of docs d 1 , … d n Find documents relevant to q Typically expressed as a ranking on d 1 ,… d n
Basic Approach to IR Given query q and set of docs d 1 , … d n Find documents relevant to q Typically expressed as a ranking on d 1 ,… d n Similarity measure sim(a,b) ! R Sort by sim(q,d i ) Optimal if relevance of documents are independent. [Robertson, 1977]
Vector Space Model Represent documents as vectors One dimension for each word Queries as short documents Similarity Measures Cosine similarity = normalized dot product
Cosine Similarity Example
Other Methods TF-IDF  [Salton & Buckley, 1988] Okapi BM25 [Robertson et al., 1995] Language Models [Ponte & Croft, 1998] [Zhai & Lafferty, 2001]
Machine Learning IR uses fixed models to define similarity scores Many opportunities to learn models Appropriate training data Appropriate learning formulation Will mostly use SVM formulations as examples General insights are applicable to other techniques.
Training Data Supervised learning problem Document/query pairs Embedded in high dimensional feature space Labeled by relevance of doc to query Traditionally 0/1 Recently ordinal classes of relevance (0,1,2,3,…)
Feature Space Use to learn a similarity/compatibility function Based off existing IR methods Can use raw values Or transformations of raw values Based off raw words Capture co-occurrence of words
Training Instances
Learning Problem Given training instances: (x q,d , y q,d ) for q = {1..N}, d = {1 .. N q } Learn a ranking function f(x q,1 , … x q,Nq  )  !  Ranking Typically decomposed into per doc scores f(x)  ! R  (doc/query compatibility) Sort by scores for all instances of a given q
How to Train? Classification & Regression Learn f(x)  !  R  in conventional ways Sort by f(x) for all docs for a query Typically does not work well 2 Major Problems Labels have ordering  Additional structure compared to multiclass problems Severe class imbalance Most documents are not relevant
Conventional multiclass learning does not incorporate ordinal structure of class labels Not Relevant Somewhat Relevant Very Relevant
Conventional multiclass learning does not incorporate ordinal structure of class labels Not Relevant Somewhat Relevant Very Relevant
Ordinal Regression Assume class labels are ordered True since class labels indicate level of relevance Learn hypothesis function f(x)  !  R Such that the ordering of f(x) agrees with label ordering Ex: given instances (x, 1), (y, 1), (z, 2) f(x) < f(z) f(y) < f(z) Don’t care about f(x) vs f(y)
Ordinal Regression Compare with classification Similar to multiclass prediction But classes have ordinal structure Compare with regression Doesn’t necessarily care about value of f(x) Only care that ordering is preserved
Ordinal Regression Approaches Learn multiple thresholds  Learn multiple classifiers Optimize pairwise preferences
Option 1: Multiple Thresholds Maintain T thresholds (b 1 , … b T ) b 1  < b 2  < … < b T Learn model parameters + (b 1 , …, b T ) Goal Model predicts a score on input example Minimize threshold violation of predictions
Ordinal SVM Example [Chu & Keerthi, 2005]
Ordinal SVM Formulation Such that for j = 0..T : [Chu & Keerthi, 2005] And also:
Learning Multiple Thresholds Gaussian Processes [Chu & Ghahramani, 2005] Decision Trees [Kramer et al., 2001] Neural Nets RankProp [Caruana et al., 1996] SVMs & Perceptrons PRank [Crammer & Singer, 2001] [Chu & Keerthi, 2005]
Option 2: Voting Classifiers Use T different training sets Classifier 1 predicts 0 vs 1,2,…T Classifier 2 predicts 0,1 vs 2,3,…T … Classifier T predicts 0,1,…,T-1 vs T Final prediction is combination E.g., sum of predictions Recent work McRank [Li et al., 2007] [Qin et al., 2007]
Severe class imbalance  Near perfect performance by always predicting 0
Option 3: Pairwise Preferences Most popular approach for IR applications Learn model to minimize pairwise disagreements %(Pairwise Agreements) = ROC-Area
2 pairwise disagreements
Optimizing Pairwise Preferences Consider instances (x 1 ,y 1 ) and (x 2 ,y 2 ) Label order has y 1  > y 2
Optimizing Pairwise Preferences Consider instances (x 1 ,y 1 ) and (x 2 ,y 2 ) Label order has y 1  > y 2 Create new training instance (x’, +1) where x’ = (x 1  – x 2 ) Repeat for all instance pairs with label order preference
Optimizing Pairwise Preferences Result: new training set! Often represented implicitly Has only positive examples Mispredicting means that a lower ordered instance received higher score than higher order instance.
Pairwise SVM Formulation Such that: [Herbrich et al., 1999] Can be reduced to  time [Joachims, 2005].
Optimizing Pairwise Preferences Neural Nets RankNet [Burges et al., 2005] Boosting & Hedge-Style Methods [Cohen et al., 1998] RankBoost [Freund et al., 2003] [Long & Servidio, 2007] SVMs [Herbrich et al., 1999] SVM-perf [Joachims, 2005] [Cao et al., 2006]
Rank-Based Measures Pairwise Preferences not quite right Assigns equal penalty for errors no matter where in the ranking People (mostly) care about top of ranking IR community use rank-based measures which capture this property.
Rank-Based Measures Binary relevance Precision@K (P@K) Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Multiple levels of relevance Normalized Discounted Cumulative Gain (NDCG)
[email_address] Set a rank threshold K Compute % relevant in top K Ignores documents ranked lower than K Ex:  Prec@3 of 2/3  Prec@4 of 2/4 Prec@5 of 3/5
Mean Average Precision Consider rank position of each relevance doc K 1 , K 2 , … K R Compute Precision@K for each K 1 , K 2 , … K R Average precision = average of P@K Ex:  has AvgPrec of MAP is Average Precision across multiple queries/rankings
Mean Reciprocal Rank Consider rank position, K, of first relevance doc Reciprocal Rank score = MRR is the mean RR across multiple queries
NDCG Normalized Discounted Cumulative Gain Multiple Levels of Relevance DCG: contribution of ith rank position:  Ex:  has DCG score of NDCG is normalized DCG  best possible ranking as score NDCG = 1
Optimizing Rank-Based Measures Let’s directly optimize these measures As opposed to some proxy (pairwise prefs) But… Objective function no longer decomposes Pairwise prefs decomposed into each pair Objective function flat or discontinuous
Discontinuity Example NDCG = 0.63 D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3 Relevance 0 1 0
Discontinuity Example NDCG computed using rank positions Ranking via retrieval scores D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3
Discontinuity Example NDCG computed using rank positions Ranking via retrieval scores Slight changes to model parameters  Slight changes to retrieval scores No change to ranking No change to NDCG D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3
Discontinuity Example NDCG computed using rank positions Ranking via retrieval scores Slight changes to model parameters  Slight changes to retrieval scores No change to ranking No change to NDCG NDCG discontinuous w.r.t model parameters! D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3
[Yue & Burges, 2007]
Optimizing Rank-Based Measures Relaxed Upper Bound Structural SVMs for hinge loss relaxation SVM-map [Yue et al., 2007]  [Chapelle et al., 2007] Boosting for exponential loss relaxation [Zheng et al., 2007] AdaRank [Xu et al., 2007] Smooth Approximations for Gradient Descent LambdaRank [Burges et al., 2006] SoftRank GP [Snelson & Guiver, 2007]
Structural SVMs Let x denote the set of documents/query examples for a query Let y denote a (weak) ranking Same objective function: Constraints are defined for each incorrect labeling y’ over the set of documents x. After learning w, a prediction is made by sorting on  w T x i [Tsochantaridis et al., 2007]
Structural SVMs for MAP Maximize subject to where  ( y ij   = {-1, +1} )  and Sum of slacks  upper bound MAP loss. [Yue et al., 2007]
Too Many Constraints! For Average Precision, the  true labeling  is a ranking where the relevant documents are all ranked in the front, e.g., An  incorrect labeling  would be any other ranking, e.g., This ranking has Average Precision of about 0.8 with   (y,y’)  ¼  0.2 Intractable number of rankings, thus an intractable number of constraints !
Structural SVM Training STEP 1:  Solve the SVM objective function using only the current working set of constraints. STEP 2:  Using the model learned in STEP 1, find the most violated constraint from the exponential set of constraints. STEP 3:  If the constraint returned in STEP 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set. Repeat STEP 1-3  until no additional constraints are added.  Return the most recent model that was trained in STEP 1.  STEP 1-3 is guaranteed to loop for at most a polynomial number of iterations.  [Tsochantaridis et al., 2005]
Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
Illustrative Example Original SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
Finding Most Violated Constraint Required for structural SVM training Depends on structure of loss function Depends on structure of joint discriminant Efficient algorithms exist despite intractable number of constraints. More than one approach [Yue et al., 2007] [Chapelle et al., 2007]
Gradient Descent Objective function is discontinuous Difficult to define a smooth global approximation Upper-bound relaxations (e.g., SVMs, Boosting) sometimes too loose. We only need the gradient! But objective is discontinuous… …  so gradient is undefined Solution: smooth approximation of the gradient Local approximation
LambdaRank Assume implicit objective function C Goal: compute dC/ds i s i  = f(x i ) denotes score of document x i Given gradient on document scores Use chain rule to compute gradient on model parameters (of f) [Burges et al., 2006]
Intuition:   Rank-based measures emphasize top of ranking Higher ranked docs should have larger derivatives (Red Arrows) Optimizing pairwise preferences emphasize bottom  of ranking (Black Arrows) [Burges, 2007]
LambdaRank for NDCG The pairwise derivative of pair i,j is Total derivative of output s i  is
Properties of LambdaRank 1 There exists a cost function C if Amounts to the Hessian of C being symmetric If Hessian also positive semi-definite, then C is convex. 1 Subject to additional assumptions – see [Burges et al., 2006]
Summary (Part 1) Machine learning is a powerful tool for designing information retrieval models Requires clean formulation of objective Advances Ordinal regression Dealing with severe class imbalances Optimizing rank-based measures via relaxations Gradient descent on non-smooth objective functions

Part 1

  • 1.
    Learning to Rank (part 1) NESCAI 2008 Tutorial Yisong Yue Cornell University
  • 2.
  • 3.
    Goals for thisTutorial Basics of information retrieval What machine learning contributes New challenges to address New insights on developing ML algorithms
  • 4.
    (Soft) Prerequisites Basicknowledge of ML algorithms Support Vector Machines Neural Nets Decision Trees Boosting Etc… Will introduce IR concepts as needed
  • 5.
    Outline (Part 1)Conventional IR Methods (no learning) 1970s to 1990s Ordinal Regression 1994 onwards Optimizing Rank-Based Measures 2005 to present
  • 6.
    Outline (Part 2)Effectively collecting training data E.g., interpreting clickthrough data Beyond independent relevance E.g., diversity Summary & Discussion
  • 7.
    Disclaimer This talkis very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods
  • 8.
    Disclaimer This talkis very ML-centric Use IR methods to generate features Learn good ranking functions on feature space Focus on optimizing cleanly formulated objectives Outperform traditional IR methods Information Retrieval Broader than the scope of this talk Deals with more sophisticated modeling questions Will see more interplay between IR and ML in Part 2
  • 9.
    Brief Overview ofIR Predated the internet As We May Think by Vannevar Bush (1945) Active research topic by the 1960’s Vector Space Model (1970s) Probabilistic Models (1980s) Introduction to Information Retrieval (2008) C. Manning, P. Raghavan & H. Schütze
  • 10.
    Basic Approach toIR Given query q and set of docs d 1 , … d n Find documents relevant to q Typically expressed as a ranking on d 1 ,… d n
  • 11.
    Basic Approach toIR Given query q and set of docs d 1 , … d n Find documents relevant to q Typically expressed as a ranking on d 1 ,… d n Similarity measure sim(a,b) ! R Sort by sim(q,d i ) Optimal if relevance of documents are independent. [Robertson, 1977]
  • 12.
    Vector Space ModelRepresent documents as vectors One dimension for each word Queries as short documents Similarity Measures Cosine similarity = normalized dot product
  • 13.
  • 14.
    Other Methods TF-IDF [Salton & Buckley, 1988] Okapi BM25 [Robertson et al., 1995] Language Models [Ponte & Croft, 1998] [Zhai & Lafferty, 2001]
  • 15.
    Machine Learning IRuses fixed models to define similarity scores Many opportunities to learn models Appropriate training data Appropriate learning formulation Will mostly use SVM formulations as examples General insights are applicable to other techniques.
  • 16.
    Training Data Supervisedlearning problem Document/query pairs Embedded in high dimensional feature space Labeled by relevance of doc to query Traditionally 0/1 Recently ordinal classes of relevance (0,1,2,3,…)
  • 17.
    Feature Space Useto learn a similarity/compatibility function Based off existing IR methods Can use raw values Or transformations of raw values Based off raw words Capture co-occurrence of words
  • 18.
  • 19.
    Learning Problem Giventraining instances: (x q,d , y q,d ) for q = {1..N}, d = {1 .. N q } Learn a ranking function f(x q,1 , … x q,Nq ) ! Ranking Typically decomposed into per doc scores f(x) ! R (doc/query compatibility) Sort by scores for all instances of a given q
  • 20.
    How to Train?Classification & Regression Learn f(x) ! R in conventional ways Sort by f(x) for all docs for a query Typically does not work well 2 Major Problems Labels have ordering Additional structure compared to multiclass problems Severe class imbalance Most documents are not relevant
  • 21.
    Conventional multiclass learningdoes not incorporate ordinal structure of class labels Not Relevant Somewhat Relevant Very Relevant
  • 22.
    Conventional multiclass learningdoes not incorporate ordinal structure of class labels Not Relevant Somewhat Relevant Very Relevant
  • 23.
    Ordinal Regression Assumeclass labels are ordered True since class labels indicate level of relevance Learn hypothesis function f(x) ! R Such that the ordering of f(x) agrees with label ordering Ex: given instances (x, 1), (y, 1), (z, 2) f(x) < f(z) f(y) < f(z) Don’t care about f(x) vs f(y)
  • 24.
    Ordinal Regression Comparewith classification Similar to multiclass prediction But classes have ordinal structure Compare with regression Doesn’t necessarily care about value of f(x) Only care that ordering is preserved
  • 25.
    Ordinal Regression ApproachesLearn multiple thresholds Learn multiple classifiers Optimize pairwise preferences
  • 26.
    Option 1: MultipleThresholds Maintain T thresholds (b 1 , … b T ) b 1 < b 2 < … < b T Learn model parameters + (b 1 , …, b T ) Goal Model predicts a score on input example Minimize threshold violation of predictions
  • 27.
    Ordinal SVM Example[Chu & Keerthi, 2005]
  • 28.
    Ordinal SVM FormulationSuch that for j = 0..T : [Chu & Keerthi, 2005] And also:
  • 29.
    Learning Multiple ThresholdsGaussian Processes [Chu & Ghahramani, 2005] Decision Trees [Kramer et al., 2001] Neural Nets RankProp [Caruana et al., 1996] SVMs & Perceptrons PRank [Crammer & Singer, 2001] [Chu & Keerthi, 2005]
  • 30.
    Option 2: VotingClassifiers Use T different training sets Classifier 1 predicts 0 vs 1,2,…T Classifier 2 predicts 0,1 vs 2,3,…T … Classifier T predicts 0,1,…,T-1 vs T Final prediction is combination E.g., sum of predictions Recent work McRank [Li et al., 2007] [Qin et al., 2007]
  • 31.
    Severe class imbalance Near perfect performance by always predicting 0
  • 32.
    Option 3: PairwisePreferences Most popular approach for IR applications Learn model to minimize pairwise disagreements %(Pairwise Agreements) = ROC-Area
  • 33.
  • 34.
    Optimizing Pairwise PreferencesConsider instances (x 1 ,y 1 ) and (x 2 ,y 2 ) Label order has y 1 > y 2
  • 35.
    Optimizing Pairwise PreferencesConsider instances (x 1 ,y 1 ) and (x 2 ,y 2 ) Label order has y 1 > y 2 Create new training instance (x’, +1) where x’ = (x 1 – x 2 ) Repeat for all instance pairs with label order preference
  • 36.
    Optimizing Pairwise PreferencesResult: new training set! Often represented implicitly Has only positive examples Mispredicting means that a lower ordered instance received higher score than higher order instance.
  • 37.
    Pairwise SVM FormulationSuch that: [Herbrich et al., 1999] Can be reduced to time [Joachims, 2005].
  • 38.
    Optimizing Pairwise PreferencesNeural Nets RankNet [Burges et al., 2005] Boosting & Hedge-Style Methods [Cohen et al., 1998] RankBoost [Freund et al., 2003] [Long & Servidio, 2007] SVMs [Herbrich et al., 1999] SVM-perf [Joachims, 2005] [Cao et al., 2006]
  • 39.
    Rank-Based Measures PairwisePreferences not quite right Assigns equal penalty for errors no matter where in the ranking People (mostly) care about top of ranking IR community use rank-based measures which capture this property.
  • 40.
    Rank-Based Measures Binaryrelevance Precision@K (P@K) Mean Average Precision (MAP) Mean Reciprocal Rank (MRR) Multiple levels of relevance Normalized Discounted Cumulative Gain (NDCG)
  • 41.
    [email_address] Set arank threshold K Compute % relevant in top K Ignores documents ranked lower than K Ex: Prec@3 of 2/3 Prec@4 of 2/4 Prec@5 of 3/5
  • 42.
    Mean Average PrecisionConsider rank position of each relevance doc K 1 , K 2 , … K R Compute Precision@K for each K 1 , K 2 , … K R Average precision = average of P@K Ex: has AvgPrec of MAP is Average Precision across multiple queries/rankings
  • 43.
    Mean Reciprocal RankConsider rank position, K, of first relevance doc Reciprocal Rank score = MRR is the mean RR across multiple queries
  • 44.
    NDCG Normalized DiscountedCumulative Gain Multiple Levels of Relevance DCG: contribution of ith rank position: Ex: has DCG score of NDCG is normalized DCG best possible ranking as score NDCG = 1
  • 45.
    Optimizing Rank-Based MeasuresLet’s directly optimize these measures As opposed to some proxy (pairwise prefs) But… Objective function no longer decomposes Pairwise prefs decomposed into each pair Objective function flat or discontinuous
  • 46.
    Discontinuity Example NDCG= 0.63 D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3 Relevance 0 1 0
  • 47.
    Discontinuity Example NDCGcomputed using rank positions Ranking via retrieval scores D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3
  • 48.
    Discontinuity Example NDCGcomputed using rank positions Ranking via retrieval scores Slight changes to model parameters Slight changes to retrieval scores No change to ranking No change to NDCG D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3
  • 49.
    Discontinuity Example NDCGcomputed using rank positions Ranking via retrieval scores Slight changes to model parameters Slight changes to retrieval scores No change to ranking No change to NDCG NDCG discontinuous w.r.t model parameters! D1 D2 D3 Retrieval Score 0.9 0.6 0.3 Rank 1 2 3
  • 50.
  • 51.
    Optimizing Rank-Based MeasuresRelaxed Upper Bound Structural SVMs for hinge loss relaxation SVM-map [Yue et al., 2007] [Chapelle et al., 2007] Boosting for exponential loss relaxation [Zheng et al., 2007] AdaRank [Xu et al., 2007] Smooth Approximations for Gradient Descent LambdaRank [Burges et al., 2006] SoftRank GP [Snelson & Guiver, 2007]
  • 52.
    Structural SVMs Letx denote the set of documents/query examples for a query Let y denote a (weak) ranking Same objective function: Constraints are defined for each incorrect labeling y’ over the set of documents x. After learning w, a prediction is made by sorting on w T x i [Tsochantaridis et al., 2007]
  • 53.
    Structural SVMs forMAP Maximize subject to where ( y ij = {-1, +1} ) and Sum of slacks upper bound MAP loss. [Yue et al., 2007]
  • 54.
    Too Many Constraints!For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g., An incorrect labeling would be any other ranking, e.g., This ranking has Average Precision of about 0.8 with  (y,y’) ¼ 0.2 Intractable number of rankings, thus an intractable number of constraints !
  • 55.
    Structural SVM TrainingSTEP 1: Solve the SVM objective function using only the current working set of constraints. STEP 2: Using the model learned in STEP 1, find the most violated constraint from the exponential set of constraints. STEP 3: If the constraint returned in STEP 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set. Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1. STEP 1-3 is guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al., 2005]
  • 56.
    Illustrative Example OriginalSVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
  • 57.
    Illustrative Example OriginalSVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
  • 58.
    Illustrative Example OriginalSVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
  • 59.
    Illustrative Example OriginalSVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Structural SVM Approach Repeatedly finds the next most violated constraint… … until set of constraints is a good approximation.
  • 60.
    Finding Most ViolatedConstraint Required for structural SVM training Depends on structure of loss function Depends on structure of joint discriminant Efficient algorithms exist despite intractable number of constraints. More than one approach [Yue et al., 2007] [Chapelle et al., 2007]
  • 61.
    Gradient Descent Objectivefunction is discontinuous Difficult to define a smooth global approximation Upper-bound relaxations (e.g., SVMs, Boosting) sometimes too loose. We only need the gradient! But objective is discontinuous… … so gradient is undefined Solution: smooth approximation of the gradient Local approximation
  • 62.
    LambdaRank Assume implicitobjective function C Goal: compute dC/ds i s i = f(x i ) denotes score of document x i Given gradient on document scores Use chain rule to compute gradient on model parameters (of f) [Burges et al., 2006]
  • 63.
    Intuition: Rank-based measures emphasize top of ranking Higher ranked docs should have larger derivatives (Red Arrows) Optimizing pairwise preferences emphasize bottom of ranking (Black Arrows) [Burges, 2007]
  • 64.
    LambdaRank for NDCGThe pairwise derivative of pair i,j is Total derivative of output s i is
  • 65.
    Properties of LambdaRank1 There exists a cost function C if Amounts to the Hessian of C being symmetric If Hessian also positive semi-definite, then C is convex. 1 Subject to additional assumptions – see [Burges et al., 2006]
  • 66.
    Summary (Part 1)Machine learning is a powerful tool for designing information retrieval models Requires clean formulation of objective Advances Ordinal regression Dealing with severe class imbalances Optimizing rank-based measures via relaxations Gradient descent on non-smooth objective functions

Editor's Notes

  • #10 References: V. Bush. “As We May Think.” The Atlantic Monthly, July 1945. C. Manning, P. Raghavan, H. Schütze. “Introduction to Information Retrieval.” Cambridge University Press, 2008.
  • #11 References: S. Robertson. “The Probability Ranking Principle in IR.” Journal of Documentation, 33(4), 294-304, December 1977.
  • #12 References: S. Robertson. “The Probability Ranking Principle in IR.” Journal of Documentation, 33(4), 294-304, December 1977.
  • #15 References: -G. Salton, C. Buckley. “Term-Weighting Approaches in Automatic Text Retrieval.” Information Processing and Management, 24(5), 513-523, 1988. -S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, M. Gatford. “Okapi at TREC-3.” In Proceedings of TREC-3, 1995. -J. Ponte, W. B. Croft. “A Language Modeling Approach to Information Retrieval.” In Proceedings of SIGIR 1998. C. Zhai, J. Lafferty. “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval.” In Proceedings of SIGIR 2001.
  • #28 References: W. Chu, S. Keerthi. “New Approaches to Support Vector Ordinal Regression.” In Proceedings of ICML 2005. (http://www.gatsby.ucl.ac.uk/%7Echuwei/svor.htm)
  • #29 References: W. Chu, S. Keerthi. “New Approaches to Support Vector Ordinal Regression.” In Proceedings of ICML 2005. (http://www.gatsby.ucl.ac.uk/%7Echuwei/svor.htm)
  • #30 References: W. Chu, Z. Ghahramani. “Gaussian Processes for Ordinal Regression.” Journal of Machine Learning Research, 6, 1019-1041, 2005 S. Kramer, G. Widmer, B. Pfahringer, M. De Groeve. “Prediction of Ordinal Classes Using Regression Trees.” Fundamenta Informaticae, XXI, 1001-1013, 2001 R. Caruana, S. Baluja, T. Mitchell. “Using the Future to `Sort Out’ the Present: Rankprop and Multitask Learning for Medical Risk Evaluation.” In Proceedings of NIPS 1996. K. Crammer, Y. Singer. “PRanking with Ranking.” In Proceedings of NIPS 2001. W. Chu, S. Keerthi. “New Approaches to Support Vector Ordinal Regression.” In Proceedings of ICML 2005. (http://www.gatsby.ucl.ac.uk/%7Echuwei/svor.htm)
  • #31 References: P. Li, C. Burges, Q. Wu. “Learning to Rank Using Classification and Gradient Boosting.” In Proceedings of NIPS 2007 T. Qin, X. Zhang, D. Wang, T. Liu, W. Lai, H. Li. “Ranking with Multiple Hyperplanes.” In Proceedings of SIGIR 2007.
  • #38 References: R. Herbrich, T. Graepel, K. Obermayer. “Support Vector Learning for Ordinal Regression.” In Proceedings of ICANN 1999. T. Joachims, “A Support Vector Method for Multivariate Performance Measures.” In Proceedings of ICML 2005. (http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html)
  • #39 References: C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender. “Learning to Rank Using Gradient Descent.” In Proceedings of ICML 2005. W. Cohen, R. Schapire, Y. Singer. “Learning to Order Things.” In Proceedings of NIPS 1998. Y. Freund, R. Iyer, R. Schapire, Y. Singer. “An Efficient Boosting Algorithm for Combining Preferences.” Journal of Machine Learning Research, 4, 933-969, 2003. P. Long, R. Servedio. “Boosting the Area Under the ROC Curve.” In Proceedings of NIPS 2007. R. Herbrich, T. Graepel, K. Obermayer. “Support Vector Learning for Ordinal Regression.” In Proceedings of ICANN 1999. T. Joachims, “A Support Vector Method for Multivariate Performance Measures.” In Proceedings of ICML 2005. (http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html) Y. Cao, J. Xu, T. Liu, H. Li, Y. Huang, H. Hon. “Adapting Ranking SVM to Document Retrieval.” In Proceedings of SIGIR 2006.
  • #51 References: Y. Yue, C. Burges. “On Using Simultaneous Perturbation Stochastic Approximation on Learning to Rank; and, the Empirical Optimality of LambdaRank.” Microsoft Research Technical Report, MSR-TR-2007-115, 2007.
  • #52 References: Y. Yue, T. Finley, F. Radlinski, T. Joachims. “A Support Vector Method for Optimizing Average Precision.” In Proceedings of SIGIR 2007. (http://projects.yisongyue.com/svmmap/) O. Chapelle, Q. Le, A. Smola. “Large Margin Optimization of Ranking Measures.” NIPS Workshop on Machine Learning for Web Search, 2007. Z. Zheng, H. Zha, T. Zhang, O. Chapelle, K. Chen, G. Sun. “A General Boosting Method and its Application to Learning Ranking Functions for Web Search.” In Proceedings of NIPS 2007. J. Xu and H. Li, “AdaRank: A Boosting Algorithm for Information Retrieval.” In Proceedings of SIGIR 2007. C. Burges, R. Ragno, Q. Le. “Learning to Rank with Non-Smooth Cost Functions.” In Proceedings of NIPS 2006. E. Snelson, J. Guiver. “SoftRank with Gaussian Processes.” NIPS Workshop on Machine Learning for Web Search, 2007.
  • #53 References: I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. “Large Margin Methods for Structured and Interdependent Output Variables.” Journal of Machine Learning Research, 6, 1453-1484, 2005.
  • #54 References: Y. Yue, T. Finley, F. Radlinski, T. Joachims. “A Support Vector Method for Optimizing Average Precision.” In Proceedings of SIGIR 2007. (http://projects.yisongyue.com/svmmap/)
  • #56 References: I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. “Large Margin Methods for Structured and Interdependent Output Variables.” Journal of Machine Learning Research, 6, 1453-1484, 2005.
  • #61 References: Y. Yue, T. Finley, F. Radlinski, T. Joachims. “A Support Vector Method for Optimizing Average Precision.” In Proceedings of SIGIR 2007. (http://projects.yisongyue.com/svmmap/) O. Chapelle, Q. Le, A. Smola. “Large Margin Optimization of Ranking Measures.” NIPS Workshop on Machine Learning for Web Search, 2007.
  • #63 References: C. Burges, R. Ragno, Q. Le. “Learning to Rank with Non-Smooth Cost Functions.” In Proceedings of NIPS 2006
  • #64 References: C. Burges. “Learning to Rank for Web Search: Some New Directions.” Keynote Speech, SIGIR 2007 Workshop on Learning to Rank for Information Retrieval.
  • #66 References: C. Burges, R. Ragno, Q. Le. “Learning to Rank with Non-Smooth Cost Functions.” In Proceedings of NIPS 2006