Diversity

430 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
430
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Diversity

  1. 1. Relevance in Diversity Sumit Bhatia
  2. 2. Outline  What is Diversity?  Some Approaches for Diversification  MMR (SIGIR ‗98)  Multi Armed Bandits (ICML ‗08)  Portfolio Theory (SIGIR‗09)  Modeling user intent (WSDM ‗09)  Query Reformulation (WWW ‗10)  Evaluation Metrics and Datasets  Conclusions2 2/23/2010
  3. 3. Motivation  Basic premise of all IR models that we have discussed  Is it always true?  Similar documents  Ambiguous Queries Key idea!  Polysemy : JAVA Diversify search  Insufficient information  Synonymy : Bombay or Mumbai results3 2/23/2010
  4. 4. What is Diversity?  Extrinsic Diversity  Uncertainty about the information need given a query  JAVA, JAGUAR  Query = page rank ; who is my audience?  A layman or an IR researcher  Intrinsic Diversity  Avoiding redundancy in search results  Encouraging novelty  Examples – literature survey, opinion mining, product reviews4 2/23/2010
  5. 5. Some Approaches for Diversification5 2/23/2010
  6. 6. MMR (Carbonell et al., SIGIR ’98)  Combines query relevance and information novelty  Used in summarization and IR ranking  Instead of ranking by Relevance, rank by “relevant novelty” Relevance Novelty6 2/23/2010
  7. 7. Portfolio Theory (Wang, SIGIR 09)  PRP optimizes mean of relevance scores of top-k search results  Ideas from portfolio theory from finance:  To maximize profit, include some ―risky‖ stocks in your portfolio  To maximize expected relevance, include some ―risky‖ documents (less relevant)  Maximize mean of relevance scores with a given risk (variance)  Risk taking encourages diversity7 2/23/2010
  8. 8. A Learning To Rank Approach  Learn a ranking function  Manually labeled training data  Optimize IR performance metrics  Deploy the function in a live search engine8 2/23/2010
  9. 9. A Learning To Rank Approach  Learn a ranking function  Manually labeled training data Online Learning using Usage Data  Optimize IR performance metrics  Deploy the function in a live search engine Use clickthrough data to maximize the probability that a new user will find atleast one relevant document high up in the ranking.9 2/23/2010
  10. 10. Problem Set Up  To produce optimally diverse ranking of documents for one fixed query.  Given a population of user  Each user ui has an associated set of relevant documents Ai  Different users have different relevant sets based on their interpretation of the query.  User is presented with an ordered list of documents  User ut clicks on document i with probability pti10 2/23/2010
  11. 11. 11 2/23/2010
  12. 12. 12 2/23/2010
  13. 13. Modeling User Intent (Agrawal et al., WSDM 2009)  Assume a taxonomy of information  User intents are modeled as topics  Documents and Queries may belong to >1 topic  Distribution of user intents over topics/categories is available (search engine logs)  Objective function trade offs relevance and diversity  Relevance – Standard IR methods  Diversity – categories in the taxonomy13 2/23/2010
  14. 14. Objective Function  C(d) = set of categories for document d  C(q) = set of categories for query q  Distribution P(c|q) is known  V(d|q,c) = Quality Value of document d for query q, when the intended category is c Average Satisfaction Probability14 2/23/2010
  15. 15. Observations  Diversify(k) is NP-Hard  P(S|q) is a sub-modular function IDEA! A Greedy Approximate Solution  U(c|q,S) = probability that q belongs to c given all documents in S fail to satisfy the user.  R(q) = set of original relevant documents15 2/23/2010
  16. 16. Greedy Algorithm16 2/23/2010
  17. 17. Query Reformulation (Santos et al., WWW 2010)  Probabilistic framework for diversification  Models information need of an ambiguous query (or user intent) as a set of sub queries  Rank documents according to following mixture model Relevance Novelty17 2/23/2010
  18. 18. Estimating Probabilities  – Use standard probabilistic models, eg., LM Queries suggested by search engines18 2/23/2010
  19. 19. Evaluation Metrics19 2/23/2010
  20. 20. Classical Measures We know about Precision, Recall20 2/23/2010
  21. 21. Nugget based measures  Nugget – Any binary property such as ―a piece of information‖, topic/category  Model user‘s information need and document‘s information content as a set of nuggets  A document is relevant if it contains atleast one nugget from u  J(d,i) = 1 if d contains information nugget ni; 0 otherwise  if J(d,i) = 1 ; 0 otherwise21 2/23/2010
  22. 22. Nugget based measures  Gain at rank k  Discounted Cumulative Gain, DCG  Normalized Discounted Cumulative Gain,22 2/23/2010
  23. 23. Intent Aware Measures23 2/23/2010
  24. 24. Dataset  ClueWeb09 Dataset  http://boston.lti.cs.cmu.edu/Data/clueweb09/  Used in TREC 2009  1,040,809,705 web pages, in 10 languages  5 TB, compressed. (25 TB, uncompressed.)  Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)  Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)24 2/23/2010
  25. 25. Concluding Remarks25 2/23/2010
  26. 26. Concluding Remarks  Diversification has the potential to solve many IR problems  Query Ambiguity  Polysemy  Synonymy  Variations in user intents  Variations in user requirements26 2/23/2010
  27. 27. References27 2/23/2010
  28. 28.  Jaime G. Carbonell, Jade Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998: 335-336  J. Wang and J. Zhu. Portfolio Theory of Information Retrieval, in SIGIR09  Filip Radlinski, Robert Kleinberg, and Thorsten Joachims, Learning Diverse Rankings with Multi-Armed Bandits, in Proceedings of the International Conference on Machine Learning (ICML), 2008  Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong, Diversifying Search Results, WSDM 2009: 5-14.  Rodrygo Santos, Craig Macdonald and Iadh Ounis . Exploiting Query Reformulations For Web Search Result Diversification, in WWW 2010, (to appear)  C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon, Novelty and Diversity in Information Retrieval Evaluation, In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 659-666, July 2008.  Redundancy, Diversity and Interdependent Document Relevance. F. Radlinski, Paul N. Bennett, Ben Cartrette and Thorsten Joachims, SIGIR Forum, Vol 43, No. 2, 46—52, December 200928 2/23/2010
  29. 29. Questions ???29 2/23/2010
  30. 30. Appendix Diminishing Marginal Returns!30 2/23/2010

×