Diversity
Upcoming SlideShare
Loading in...5
×
 

Diversity

on

  • 378 views

 

Statistics

Views

Total Views
378
Views on SlideShare
378
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Diversity Diversity Presentation Transcript

  • Relevance in Diversity Sumit Bhatia
  • Outline  What is Diversity?  Some Approaches for Diversification  MMR (SIGIR ‗98)  Multi Armed Bandits (ICML ‗08)  Portfolio Theory (SIGIR‗09)  Modeling user intent (WSDM ‗09)  Query Reformulation (WWW ‗10)  Evaluation Metrics and Datasets  Conclusions2 2/23/2010
  • Motivation  Basic premise of all IR models that we have discussed  Is it always true?  Similar documents  Ambiguous Queries Key idea!  Polysemy : JAVA Diversify search  Insufficient information  Synonymy : Bombay or Mumbai results3 2/23/2010
  • What is Diversity?  Extrinsic Diversity  Uncertainty about the information need given a query  JAVA, JAGUAR  Query = page rank ; who is my audience?  A layman or an IR researcher  Intrinsic Diversity  Avoiding redundancy in search results  Encouraging novelty  Examples – literature survey, opinion mining, product reviews4 2/23/2010
  • Some Approaches for Diversification5 2/23/2010
  • MMR (Carbonell et al., SIGIR ’98)  Combines query relevance and information novelty  Used in summarization and IR ranking  Instead of ranking by Relevance, rank by “relevant novelty” Relevance Novelty6 2/23/2010
  • Portfolio Theory (Wang, SIGIR 09)  PRP optimizes mean of relevance scores of top-k search results  Ideas from portfolio theory from finance:  To maximize profit, include some ―risky‖ stocks in your portfolio  To maximize expected relevance, include some ―risky‖ documents (less relevant)  Maximize mean of relevance scores with a given risk (variance)  Risk taking encourages diversity7 2/23/2010
  • A Learning To Rank Approach  Learn a ranking function  Manually labeled training data  Optimize IR performance metrics  Deploy the function in a live search engine8 2/23/2010
  • A Learning To Rank Approach  Learn a ranking function  Manually labeled training data Online Learning using Usage Data  Optimize IR performance metrics  Deploy the function in a live search engine Use clickthrough data to maximize the probability that a new user will find atleast one relevant document high up in the ranking.9 2/23/2010
  • Problem Set Up  To produce optimally diverse ranking of documents for one fixed query.  Given a population of user  Each user ui has an associated set of relevant documents Ai  Different users have different relevant sets based on their interpretation of the query.  User is presented with an ordered list of documents  User ut clicks on document i with probability pti10 2/23/2010
  • 11 2/23/2010
  • 12 2/23/2010
  • Modeling User Intent (Agrawal et al., WSDM 2009)  Assume a taxonomy of information  User intents are modeled as topics  Documents and Queries may belong to >1 topic  Distribution of user intents over topics/categories is available (search engine logs)  Objective function trade offs relevance and diversity  Relevance – Standard IR methods  Diversity – categories in the taxonomy13 2/23/2010
  • Objective Function  C(d) = set of categories for document d  C(q) = set of categories for query q  Distribution P(c|q) is known  V(d|q,c) = Quality Value of document d for query q, when the intended category is c Average Satisfaction Probability14 2/23/2010
  • Observations  Diversify(k) is NP-Hard  P(S|q) is a sub-modular function IDEA! A Greedy Approximate Solution  U(c|q,S) = probability that q belongs to c given all documents in S fail to satisfy the user.  R(q) = set of original relevant documents15 2/23/2010
  • Greedy Algorithm16 2/23/2010
  • Query Reformulation (Santos et al., WWW 2010)  Probabilistic framework for diversification  Models information need of an ambiguous query (or user intent) as a set of sub queries  Rank documents according to following mixture model Relevance Novelty17 2/23/2010
  • Estimating Probabilities  – Use standard probabilistic models, eg., LM Queries suggested by search engines18 2/23/2010
  • Evaluation Metrics19 2/23/2010
  • Classical Measures We know about Precision, Recall20 2/23/2010
  • Nugget based measures  Nugget – Any binary property such as ―a piece of information‖, topic/category  Model user‘s information need and document‘s information content as a set of nuggets  A document is relevant if it contains atleast one nugget from u  J(d,i) = 1 if d contains information nugget ni; 0 otherwise  if J(d,i) = 1 ; 0 otherwise21 2/23/2010
  • Nugget based measures  Gain at rank k  Discounted Cumulative Gain, DCG  Normalized Discounted Cumulative Gain,22 2/23/2010
  • Intent Aware Measures23 2/23/2010
  • Dataset  ClueWeb09 Dataset  http://boston.lti.cs.cmu.edu/Data/clueweb09/  Used in TREC 2009  1,040,809,705 web pages, in 10 languages  5 TB, compressed. (25 TB, uncompressed.)  Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)  Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)24 2/23/2010
  • Concluding Remarks25 2/23/2010
  • Concluding Remarks  Diversification has the potential to solve many IR problems  Query Ambiguity  Polysemy  Synonymy  Variations in user intents  Variations in user requirements26 2/23/2010
  • References27 2/23/2010
  •  Jaime G. Carbonell, Jade Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998: 335-336  J. Wang and J. Zhu. Portfolio Theory of Information Retrieval, in SIGIR09  Filip Radlinski, Robert Kleinberg, and Thorsten Joachims, Learning Diverse Rankings with Multi-Armed Bandits, in Proceedings of the International Conference on Machine Learning (ICML), 2008  Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong, Diversifying Search Results, WSDM 2009: 5-14.  Rodrygo Santos, Craig Macdonald and Iadh Ounis . Exploiting Query Reformulations For Web Search Result Diversification, in WWW 2010, (to appear)  C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon, Novelty and Diversity in Information Retrieval Evaluation, In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 659-666, July 2008.  Redundancy, Diversity and Interdependent Document Relevance. F. Radlinski, Paul N. Bennett, Ben Cartrette and Thorsten Joachims, SIGIR Forum, Vol 43, No. 2, 46—52, December 200928 2/23/2010
  • Questions ???29 2/23/2010
  • Appendix Diminishing Marginal Returns!30 2/23/2010