Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning to Rank - From pairwise approach to listwise

0 views

Published on

Learning to Rank - From pairwise approach to listwise

Published in: Data & Analytics
  • Be the first to comment

Learning to Rank - From pairwise approach to listwise

  1. 1. ì   Learning  To  Rank:  From  Pairwise   Approach  to  Listwise  Approach   Zhe  Cao,  Tao  Qin,  Tie-­‐Yan  Liu,  Ming-­‐Feng  Tsai,  and  Hang  Li   Hasan  Hüseyin  Topcu   Learning  To  Rank  
  2. 2. Outline   ì  Related  Work   ì  Learning  System   ì  Learning  to  Rank   ì  Pairwise  vs.  Listwise  Approach   ì  Experiments   ì  Conclusion  
  3. 3. Related  Work   ì  Pairwise  Approach  :  Learning  task  is  formalized  as  classificaNon   of  object  pairs  into  two  categories  (  correctly  ranked  and   incorrectly  ranked)   ì  The  methods  of  classificaNon:   ì  Ranking  SVM  (Herbrich  et  al.,  1999)  and  Joachims(2002)  applied   RankingSVM  to  InformaNon  Retrieval   ì  RankBoost  (  Freund  et  al.  1998)   ì  RankNet  (Burges  et  al.  2005):    
  4. 4. Learning  System  
  5. 5. Learning  System   Training  Data,  Data  Preprocessing,  …   How  objects  are  idenNfied?   How  instances  are  modeled?   SVM,  ANN,  BoosNng   Evaluate  with  test  data   Adapted  from  Paaern  Classificaton(Duda,  Hart,  Stork)    
  6. 6. Ranking  
  7. 7. Learning  to  Rank  
  8. 8. Learning  to  Rank   ì  A  number  of  queries  are  provided   ì  Each  query  is  associated  with  perfect  ranking  list  of  documents   (Ground-­‐Truth)   ì  A  Ranking  funcNon  is  created  using  the  training  data  such  that   the  model  can  precisely  predict  the  ranking  list.   ì  Try  to  opNmize  a  Loss  funcNon  for  learning.  Note  that  the  loss   funcNon  for  ranking  is  slightly  different  in  the  sense  that  it   makes  use  of  sorNng.  
  9. 9. Training  Process  
  10. 10. Data  Labeling   ì  Explicit  Human  Judgment  (Perfect,  Excellent,  Good,  Fair,  Bad)   ì  Implicit  Relevance  Judgment  :  Derived  from  click  data  (Search   log  data)   ì  Ordered  Pairs  between  documents  (A  >  B)   ì  List  of  judgments(scores)  
  11. 11. Features  
  12. 12. Pairwise  Approach   ì  Training  data  instances  are  document  pairs  in  learning  
  13. 13. Pairwise  Approach   ì  Collects  document  pairs  from  the  ranking  list  and  for  each   document  pairs  it  assigns  a  label.   ì   Data  labels  +1  if  score  of  A  >  B  and  -­‐1  if  A  <  B     ì  Formalizes  the  problem  of  learning  to  rank  as  binary   classificaNon   ì  RankingSVM,  RankBoost  and  RankNet  
  14. 14. Pairwise  Approach  Drawbacks   ì  ObjecNve  of  learning  is  formalized  as  minimizing  errors  in   classificaNon  of  document  pairs  rather  than  minimizing  errors  in   ranking  of  documents.   ì  Training  process  is  computaNonally  costly,  as  the  documents  of   pairs  is  very  large.  
  15. 15. Pairwise  Approach  Drawbacks   ì  Equally  treats  document  pairs  across  different   grades  (labels)  (Ex.1)   ì  The  number  of  generated  document  pairs  varies   largely  from  query  to  query,  which  will  result  in   training  a  model  biased  toward  queries  with  more   document  pairs.  (Ex.2)  
  16. 16. Listwise  Approach   ì  Training  data  instances  are  document  list   ì  The  objecNve  of  learning  is  formalized  as  minimizaNon  of  the   total  loses  with  respect  to  the  training  data.   ì  Listwise  Loss  FuncNon  uses  probability  models:  Permuta(on   Probability  and  Top  One  Probability   ments d(i0 ) are given, we construct feature vectors x(i0 ) from them and use the trained ranking function to assign scores to the documents d(i0 ) . Finally we rank the documents d(i0 ) in descending order of the scores. We call the learning problem described above as the listwise approach to learn- ing to rank. By contrast, in the pairwise approach, a new training data set T 0 is created from T , in which each feature vector pair x(i) j and x(i) k forms a new instance where j , k, and +1 is assigned to the pair if y(i) j is larger than y(i) k otherwise 1. It turns out that the training data T 0 is a data set of bi- nary classification. A classification model like SVM can be created. As explained in Section 1, although the pair- of scores s is defined Ps(⇡ where s⇡( j) is the scor ⇡. Let us consider an ex ing scores s = (s1, s tions ⇡ = h1, 2, 3i an lows: Ps(⇡) = ( (s1) + ( ments d(i0 ) are given, we construct feature vectors x(i0 ) from them and use the trained ranking function to assign scores to the documents d(i0 ) . Finally we rank the documents d(i0 ) in descending order of the scores. We call the learning problem described above as the listwise approach to learn- ing to rank. By contrast, in the pairwise approach, a new training data set T 0 is created from T , in which each feature vector pair x(i) j and x(i) k forms a new instance where j , k, and +1 is assigned to the pair if y(i) j is larger than y(i) k otherwise 1. It turns out that the training data T 0 is a data set of bi- nary classification. A classification model like SVM can be created. As explained in Section 1, although the pair- of scores s is defined Ps(⇡ where s⇡( j) is the scor ⇡. Let us consider an ex ing scores s = (s1, s tions ⇡ = h1, 2, 3i an lows: Ps(⇡) = ( (s1) + ( ments d(i0 ) are given, we construct feature vectors x(i0 ) from them and use the trained ranking function to assign scores to the documents d(i0 ) . Finally we rank the documents d(i0 ) in descending order of the scores. We call the learning problem described above as the listwise approach to learn- ing to rank. By contrast, in the pairwise approach, a new training data set T 0 is created from T , in which each feature vector pair x(i) j and x(i) k forms a new instance where j , k, and +1 is assigned to the pair if y(i) j is larger than y(i) k otherwise 1. It turns out that the training data T 0 is a data set of bi- nary classification. A classification model like SVM can be created. As explained in Section 1, although the pair- of scores s is defined Ps(⇡ where s⇡( j) is the scor ⇡. Let us consider an ex ing scores s = (s1, s tions ⇡ = h1, 2, 3i an lows: Ps(⇡) = ( (s1) + (
  17. 17. Permutation  Probability   ì  Objects  :  {A,B,C}  and  PermutaNons:  ABC,  ACB,  BAC,  BCA,  CAB,  CBA   ì  Suppose  Ranking  funcNon  that  assigns  scores  to  objects  sA,  sB  and  sC   ì  Permuta5on  Probabilty:  Likelihood  of  a  permutaNon   ì  P(ABC)  >  P(CBA)    if      sA  >  sB  >  sC  
  18. 18. Top  One  Probability   ì  Objects  :  {A,B,C}  and  PermutaNons:  ABC,  ACB,  BAC,  BCA,  CAB,  CBA   ì  Suppose  Ranking  funcNon  that  assigns  scores  to  objects  sA,  sB  and  sC   ì  Top  one  probability  of  an  object  represents  the  probability  of  its   being  ranked  on  the  top,  given  the  scores  of  all  the  objects   ì  P(A)  =  P(ABC)  +  P(ACB)   ì  NoNce  that  in  order  to  calculate  n  top  one  probabiliNes,  we  sNll  need   to  calculate  n!  permutaNon  probabiliNes.   ì  P(A)  =  P(ABC)  +  P(ACB)   ì  P(B)  =  P(BAC)  +  P(BCA)   ì  P(C)  =  P(CBA)  +  P(CAB)  
  19. 19. Listwise  Loss  Function   ì  With  the  use  of  top  one  probability,  given  two  lists  of  scores  we   can  use  any  metric  to  represent  the  distance  between  two   score  lists.   ì  For  example  when  we  use  Cross  Entropy  as  metric,  the  listwise   loss  funcNon  becomes   ì  Ground  Truth:  ABCD      vs.      Ranking  Output:  ACBD  or  ABDC    
  20. 20. ListNet   ì  Learning  Method:  ListNet   ì  OpNmize  Listwise  Loss  funcNon  based  on  top  one  probability   with  Neural  Network  and  Gradient  Descent  as  opNmizaNon   algorithm.   ì  Linear  Network  Model  is  used  for  simplicity:  y  =  wTx  +  b  
  21. 21. ListNet  
  22. 22. Ranking  Accuracy   ì  ListNet              vs.            RankNet,  RankingSVM,  RankBoost   ì  3  Datasets:  TREC  2003,  OHSUMED  and  CSearch   ì  TREC  2003:  Relevance  Judgments  (Relevant  and  Irrelevant),  20  features   extracted   ì  OHSUMED:  Relevance  Judgments  (Definitely  Relevant,  PosiNvely   Relevant    and  Irrelevant),  30  features   ì  CSearch:  Relevance  Judgments  from  4(‘Perfect  Match’)  to  0  (‘Bad   Match’),  600  features   ì  EvaluaNon  Measures:  Normalized  Discounted  CumulaNve  Gain   (NDCG)  and  Mean  Average  Precision(MAP)      
  23. 23. Experiments   ì  NDCG@n  on  TREC    
  24. 24. Experiments   ì  NDCG@n  on  OHSUMED    
  25. 25. Experiments   ì  NDCG@n  on  CSearch    
  26. 26. Conclusion   ì  Discussed   ì  Learning  to  Rank   ì  Pairwise  approach  and  its  drawbacks   ì  Listwise  Approach  outperforms  the  exisNng  Pairwise  Approaches   ì  EvaluaNon  of  the  Paper   ì  Linear  Neural  Network  model  is  used.  What  about  Non-­‐Linear   model?   ì  Listwise  Loss  FuncNon  is  the  key  issue.(Probability  models)  
  27. 27. References   ì  Zhe  Cao,  Tao  Qin,  Tie-­‐Yan  Liu,  Ming-­‐Feng  Tsai,  and  Hang  Li.  2007.   Learning  to  rank:  from  pairwise  approach  to  listwise  approach.   In  Proceedings  of  the  24th  interna(onal  conference  on  Machine   learning  (ICML  '07),  Zoubin  Ghahramani  (Ed.).  ACM,  New  York,   NY,  USA,  129-­‐136.  DOI=10.1145/1273496.1273513  hap:// doi.acm.org/10.1145/1273496.1273513   ì  Hang  Li:  A  Short  Introduc5on  to  Learning  to  Rank.   IEICE  TransacNons  94-­‐D(10):  1854-­‐1862  (2011)   ì  Learning  to  Rank.  Hang  Li.  Microsow  Research  Asia.  ACL-­‐IJCNLP   2009  Tutorial.  Aug.  2,  2009.  Singapore  

×