Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sigir12 tutorial: Query Perfromance Prediction for IR


Published on

Published in: Technology
  • Login to see the comments

Sigir12 tutorial: Query Perfromance Prediction for IR

  1. 1. Query Performance Prediction for IR David Carmel, IBM Haifa Research Lab Oren Kurland, Technion SIGIR Tutorial Portland Oregon, August 12, 2012IBM Haifa Research Lab © 2010 IBM Corporation
  2. 2. IBM Labs in HaifaInstructors Dr. David Carmel Research Staff Member at the Information Retrieval group at IBM Haifa Research Lab Ph.D. in Computer Science from the Technion, Israel, 1997 Research Interests: search in the enterprise, query performance prediction, social search, and text mining, Dr. Oren Kurland Senior lecturer at the Technion --- Israel Institute of Technology Ph.D. in Computer Science from Cornell University, 2006 Research Interests: information retrieval, SIGIR 2010, Geneva 2 SIGIR 2012 Tutorial: Query Performance Prediction
  3. 3. IBM Labs in Haifa This tutorial is based in part on the book Estimating the query difficulty for Information Retrieval Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool publishers The tutorial presents the opinions of the presenters only, and does not necessarily reflect the views of IBM or the Technion Algorithms, techniques, features, etc. mentioned here might or might not be in use by IBM3312/07/2012 SIGIR 2010, Geneva 3 SIGIR 2012 Tutorial: Query Performance Prediction
  4. 4. IBM Labs in HaifaQuery Difficulty Estimation – Main Challenge Estimating the query difficulty is an attempt to quantify the quality of search results retrieved for a query from a given collection of documents, when no relevance feedback is given Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor Understanding why some queries are inherently more difficult than others is essential for IR A good answer to this question will help search engines to reduce the variance in performance4412/07/2012 SIGIR 2010, Geneva 4 SIGIR 2012 Tutorial: Query Performance Prediction
  5. 5. IBM Labs in HaifaEstimating the Query Difficulty – Some Benefits Feedback to users: The IR system can provide the users with an estimate of the expected quality of the results retrieved for their queries. Users can then rephrase “difficult” queries or, resubmit a “difficult” query to alternative search resources. Feedback to the search engine: The IR system can invoke alternative retrieval strategies for different queries according to their estimated difficulty. For example, intensive query analysis procedures may be invoked selectively for difficult queries only Feedback to the system administrator: For example, administrators can identify missing content queries Then expand the collection of documents to better answer these queries. For IR applications: For example, a federated (distributed) search application Merging the results of queries employed distributively over different datasets Weighing the results returned from each dataset by the predicted difficulty5512/07/2012 5 SIGIR 2012 Tutorial: Query Performance Prediction
  6. 6. IBM Labs in HaifaContents Introduction -The Robustness Problem of (ad hoc) Information Retrieval Basic Concepts Query Performance Prediction Methods Pre-Retrieval Prediction Methods Post-Retrieval Prediction Methods Combining Predictors A Unified Framework for Post-Retrieval Query-Performance Prediction A General Model for Query Difficulty A Probabilistic Framework for Query-Performance Prediction Applications of Query Difficulty Estimation Summary6 Open Challenges 2012 Tutorial:SIGIR 2010, Geneva Prediction612/07/2012 SIGIR Query Performance 6
  7. 7. Introduction – The Robustness Problem of Information RetrievalIBM Haifa Research Lab © 2010 IBM Corporation
  8. 8. IBM Labs in HaifaThe Robustness problem of IR Most IR systems suffer from a radical variance in retrieval performance when responding to users’ queries Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor This may lead to user dissatisfaction Variability in performance relates to various factors: The query itself (e.g., term ambiguity “Golf”) The vocabulary mismatch problem - the discrepancy between the query vocabulary and the document vocabulary Missing content queries - there is no relevant information in the corpus that can satisfy the information needs.8812/07/2012 SIGIR 2010, Geneva 8 SIGIR 2012 Tutorial: Query Performance Prediction
  9. 9. An example for a difficult query: IBM Labs in Haifa “The Hubble Telescope achievements” Hubble Telescope Achievements: Great eye sets sights sky high Simple test would have found flaw in Hubble telescope Nation in brief Hubble space telescope placed aboard shuttle Cause of Hubble telescope defect reportedly found Flaw in Hubble telescope Flawed mirror hampers Hubble space telescope Touchy telescope torments controllers NASA scrubs launch of discovery Hubble builders got award fees, magazine says Retrieved results deal with issues related to the Hubble telescope project in general; but the gist of that query, achievements, is lost9912/07/2012 SIGIR 2010, Geneva 9 SIGIR 2012 Tutorial: Query Performance Prediction
  10. 10. IBM Labs in HaifaThe variance in performance across queriesand systems (TREC-7) Queries are sorted in decreasing order according to average precision attained among all TREC participants (green bars). The performance of two different systems per query is shown by the two curves.101012/07/2012 SIGIR 2010, Geneva 10 SIGIR 2012 Tutorial: Query Performance Prediction
  11. 11. IBM Labs in HaifaThe Reliable Information Access (RIA) workshop The first attempt to rigorously investigate the reasons for performance variability across queries and systems. Extensive failure analysis of the results of 6 IR systems 45 TREC topics Main reason to failures - the systems’ inability to identify all important aspects of the query The failure to emphasize one aspect of a query over another, or emphasize one aspect and neglect other aspects “What disasters have occurred in tunnels used for transportation?” Emphasizing only one of these terms will deteriorate performance because each term on its own does not fully reflect the information need If systems could estimate what failure categories the query may belong to Systems could apply specific automated techniques that correspond to the failure mode in order to improve performance111112/07/2012 SIGIR 2010, Geneva 11 SIGIR 2012 Tutorial: Query Performance Prediction
  12. 12. IBM Labs in Haifa121212/07/2012 SIGIR 2010, Geneva 12 SIGIR 2012 Tutorial: Query Performance Prediction
  13. 13. Instability in Retrieval - The TREC’s Robust IBM Labs in HaifaTracks The diversity in performance among topics and systems led to the TREC Robust tracks (2003 – 2005) Encouraging systems to decrease variance in query performance by focusing on poorly performing topics Systems were challenged with 50 old TREC topics found to be “difficult” for most systems over the years A topic is considered difficult when the median of the average precision scores of all participants for that topic is below a given threshold A new measure, GMAP, uses the geometric mean instead of the arithmetic mean when averaging precision values over topics Emphasizes the lowest performing topics, and is thus a useful measure that can attest to the robustness of system’s performance131312/07/2012 SIGIR 2010, Geneva 13 SIGIR 2012 Tutorial: Query Performance Prediction
  14. 14. The Robust Tracks – Decreasing Variability IBM Labs in Haifaacross Topics Several approaches to improving the poor effectiveness for some topics were tested Selective query processing strategy based on performance prediction Post-retrieval reordering Selective weighting functions Selective query expansion None of these approaches was able to show consistent improvement over traditional non-selective approaches Apparently, expanding the query by appropriate terms extracted from an external collection (the Web) improves the effectiveness for many queries, including poorly performing queries141412/07/2012 SIGIR 2010, Geneva 14 SIGIR 2012 Tutorial: Query Performance Prediction
  15. 15. The Robust Tracks – Query Performance IBM Labs in HaifaPrediction As a second challenge: systems were asked to predict their performance for each of the test topics Then the TREC topics were ranked First, by their predicted performance value Second, by their actual performance value Evaluation was done by measuring the similarity between the predicted performance-based ranking and the actual performance-based ranking Most systems failed to exhibit reasonable prediction capability 14 runs had a negative correlation between the predicted and actual topic rankings, demonstrating that measuring performance prediction is intrinsically difficult On the positive side, the difficulty in developing reliable prediction methods raised the awareness of the IR community to this challenge151512/07/2012 SIGIR 2010, Geneva 15 SIGIR 2012 Tutorial: Query Performance Prediction
  16. 16. HowIBM Labs in Haifa is the performance prediction task difficultfor human experts? TREC-6 experiment: Estimating whether human experts can predict the query difficulty A group of experts were asked to classify a set of TREC topics to three degrees of difficulty based on the query expression only easy, middle, hard? The manual judgments were compared to the median of the average precision scores, as determined after evaluating the performance of all participating systems Results The Pearson correlation between the expert judgments and the “true” values was very low (0.26). The agreement between experts, as measured by the correlation between their judgments, was very low too (0.39) The low correlation illustrates how difficult this task is and how little is known about what makes a query difficult161612/07/2012 SIGIR 2010, Geneva 16 SIGIR 2012 Tutorial: Query Performance Prediction
  17. 17. HowIBM Labs in Haifa is the performance prediction task difficult for human experts? (contd.)Hauff et al. (CIKM 2010) found a low level of agreement between humans with regard to which queries are more difficult than others (median kappa = 0.36) there was high variance in the ability of humans to estimate query difficulty although they shared “similar” backgrounds a low correlation between true performance and humans’ estimates of query- difficulty (performed in a pre-retrieval fashion) median Kendall tau was 0.31 which was quite lower than that posted by the best performing pre-retrieval predictors however, overall the humans did manage to differentiate between “good” and “bad” queries a low correlation between humans’ predictions and those of query-performance predictors with some exceptionsThese findings further demonstrate how difficult the query-performance prediction task is and how little is known about what makes a query17 difficult SIGIR 2010, Geneva 17 1712/07/2012 SIGIR 2012 Tutorial: Query Performance Prediction
  18. 18. IBM Labs in HaifaAre queries found to be difficult in one collectionstill considered difficult in another collection? Robust track 2005: Difficult topics in the ROBUST collection were tested against another collection (AQUAINT) The median average precision over the ROBUST collection is 0.126 Compared to 0.185 for the same topics over the AQUAINT collection Apparently, the AQUAINT collection is “easier” than the ROBUST collection probably due to Collection size Many more relevant documents per topic in AQUAINT Document features such as structure and coherence However, the relative difficulty of the topics is preserved over the two datasets The Pearson correlation between topics’ performance in both datasets is 0.463 This illustrates some dependency between the topic’s median scores on both collections Even when topics are somewhat easier in one collection than another, the relative difficulty among topics is preserved, at least to some extent181812/07/2012 SIGIR 2010, Geneva 18 SIGIR 2012 Tutorial: Query Performance Prediction
  19. 19. Basic conceptsIBM Haifa Research Lab © 2010 IBM Corporation
  20. 20. IBM Labs in HaifaThe retrieval task Given: A document set D (the corpus) A query q Retrieve Dq (the result list), a ranked list of documents from D, which are most likely to be relevant to q Some widely used retrieval methods: Vector space tf-idf based ranking, which estimates relevance by the similarity between the query and a document in the vector space The probabilistic OKAPI-BM25 method, which estimates the probability that the document is relevant to the query Language-model-based approaches, which estimate the probability that the query was generated by a language model induced from the document And more, divergence from randomness (DFR) approaches, inference networks, markov random fields (MRF), …202012/07/2012 SIGIR 2010, Geneva 20 SIGIR 2012 Tutorial: Query Performance Prediction
  21. 21. IBM Labs in HaifaText REtrieval Conference (TREC) A series of workshops for large-scale evaluation of (mostly) text retrieval technology: Realistic test collections Uniform, appropriate scoring procedures Title: African Civilian Deaths Started in 1992 Description: How many civilian non-combatants have been killed in the various civil wars in Africa? Narrative: A relevant document A TREC task usually comprises of: will contain specific casualty A document collection (corpus) information for a given area, country, or region. It will cite A list of topics (information needs) numbers of civilian deaths caused directly or indirectly by armed A list of relevant documents for each conflict. topic (QRELs)212112/07/2012 SIGIR 2010, Geneva 21 SIGIR 2012 Tutorial: Query Performance Prediction
  22. 22. IBM Labs in HaifaPrecision measures Precision at k (P@k): The fraction of relevant documents among the top-k results. Average precision: AP is the average of precision values computed at the ranks of each of the relevant documents in the ranked list: 1 AP(q) = Rq ∑ P@rank(r) r∈Rq Rq is the set of documents in the corpus that are relevant to q Average Precision is usually computed using a ranking which is truncated at some position (typically 1000 in TREC).222212/07/2012 SIGIR 2010, Geneva 22 SIGIR 2012 Tutorial: Query Performance Prediction
  23. 23. IBM Labs in HaifaPrediction quality measures Given: Query q Result list Dq that contains n documents Goal: Estimate the retrieval effectiveness of Dq, in terms of satisfying Iq, the information need behind q specifically, the prediction task is to predict AP(q) when no relevance information (Rq) is given. In practice: Estimate, for example, the expected average precision for q. The quality of a performance predictor can be measured by the correlation between the expected average precision and the corresponding actual precision values232312/07/2012 SIGIR 2010, Geneva 23 SIGIR 2012 Tutorial: Query Performance Prediction
  24. 24. IBM Labs in HaifaMeasures of correlation Linear correlation (Pearson) considers the true AP values as well as the predicted AP values of queries Non-linear correlation (Kendall’s tau, Spearman’s rho) considers only the ranking of queries by their true AP values and by their predicted AP values242412/07/2012 SIGIR 2010, Geneva 24 SIGIR 2012 Tutorial: Query Performance Prediction
  25. 25. IBM Labs in HaifaEvaluating a prediction system Q5 Predicted AP Different correlation metrics will measure different things If there is no apriory reason to use Pearson, prefer Spearman or KT In practice, the difference is not Q1 large, because there are Q4 Q3 enough queries and because of Q2 the distribution of AP Actual AP252512/07/2012 SIGIR 2010, Geneva 25 SIGIR 2012 Tutorial: Query Performance Prediction
  26. 26. IBM Labs in HaifaEvaluating a prediction system (cond.) It is important to note that it was found that state-of-the-art query-performance predictors might not be correlated (at all) with measures of users’ performance (e.g., the time it takes to reach the first relevant document) see Terpin and Hersh ADC ’04 and Zhao and Scholer ADC ‘07 However!, This finding might be attributed, as suggested by Terpin and Hersh in ADC ’04, to the fact that standard evaluation measures (e.g., average precision) and users’ performance are not always strongly correlated Hersh et al. ’00, Turpin and Hersh ’01, Turpin and Scholer ’06, Smucker and Parkash Jethani ‘10262612/07/2012 SIGIR 2010, Geneva 26 SIGIR 2012 Tutorial: Query Performance Prediction
  27. 27. Query Performance Prediction MethodsIBM Haifa Research Lab © 2010 IBM Corporation
  28. 28. IBM Labs in Haifa Performance Prediction Methods Pre-Retrieval Post-Retrieval Methods Methods Score Linguistics Statistics Clarity Robustness AnalysisMorphologic Syntactic Query Perturb. Doc Perturb. Retrieval Perturb. Semantic Top score Avg. score Specificity Similarity Coherency Relatedness Variance of Cluster hypo. scores 28 2812/07/2012 SIGIR 2010, Geneva 28 SIGIR 2012 Tutorial: Query Performance Prediction
  29. 29. IBM Labs in HaifaPre-Retrieval Prediction Methods Pre-retrieval prediction approaches estimate the quality of the search results before the search takes place. Provide effective instantiation of query performance prediction for search applications that must efficiently respond to search requests Only the query terms, associated with some pre-defined statistics gathered at indexing time, can be used for prediction Pre-retrieval methods can be split to linguistic and statistical methods. Linguistic methods apply natural language processing (NLP) techniques and use external linguistic resources to identify ambiguity and polysemy in the query. Statistical methods analyze the distribution of the query terms within the collection292912/07/2012 SIGIR 2010, Geneva 29 SIGIR 2012 Tutorial: Query Performance Prediction
  30. 30. Linguistic Approaches (Mothe & Tangui (SIGIR QD IBM Labs in Haifaworkshop, SIGIR 2005), Hauff (2010)) Most linguistic features do not correlate well with the system performance. Features include, for example, morphological (avg. # of morphemes per query term), syntactic link span (which relates to the average distance between query words in the parse tree), semantic (polysemy- the avg. # of synsets per word in the WordNet dictionary) Only the syntactic links span and the polysemy value were shown to have some (low) correlation This is quite surprising as intuitively poor performance can be expected for ambiguous queries Apparently, term ambiguity should be measured using corpus-based approaches, since a term that might be ambiguous with respect to the general vocabulary, may have only a single interpretation in the corpus303012/07/2012 SIGIR 2010, Geneva 30 SIGIR 2012 Tutorial: Query Performance Prediction
  31. 31. IBM Labs in HaifaPre-retrieval Statistical methods Analyze the distribution of the query term frequencies within the collection. Two major frequent term statistics : Inverse document frequency idf(t) = log(N/Nt) Inverse collection term frequency ictf(t) = log(|D|/tf(t,D)) Specificity based predictors Measure the query terms distribution over the collection Specificity-based predictors: avgIDF, avgICTF - Queries composed of infrequent terms are easier to satisfy maxIDF, maxICTF – Similarly varIDF, varICTF - Low variance reflects the lack of dominant terms in the query A query composed of non-specific terms is deemed to be more difficult “Who and Whom”313112/07/2012 SIGIR 2010, Geneva 31 SIGIR 2012 Tutorial: Query Performance Prediction
  32. 32. IBM Labs in HaifaSpecificity-based predictors Query Scope (He & Ounis 2004) The percentage of documents in the corpus that contain at least one query term; high value indicates many candidates for retrieval, and thereby a difficult query High query scope indicates marginal prediction quality for short queries only, while for long queries its quality drops significantly QS is not a ``pure pre-retrieval predictor as it requires finding the documents containing query terms (consider dynamic corpora) Simplified Clarity Score (He & Ounis 2004) The Kullback-Leibler (KL) divergence between the (simplified) query language model and the corpus language model  p (t | q )  SC S ( q ) ∑ p ( t | q ) log   t∈ q  p (t | D )  SCS is strongly related to the avgICTF predictor assuming each term appears only once in the query : SCS(q) = log(1/|q|) + avgICTF(q)323212/07/2012 SIGIR 2010, Geneva 32 SIGIR 2012 Tutorial: Query Performance Prediction
  33. 33. IBM Labs in HaifaSimilarity, coherency, and variance -based predictors Similarity: SCQ (Zhao et al. 2008): High similarity to the corpus indicates effective retrieval SCQ(t ) = (1 + log(tf (t , D))) ⋅ idf (t ) maxSCQ(q)/avgSCQ(q) are the maximum/average over the query terms contradicts(?) the specificity idea (e.g., SCS) Coherency: CS (He et al. 2008): The average inter-document similarity between documents containing the query terms, averaged over query terms This is a conceptual pre-retrieval reminiscent of the post-retrieval autocorrelation approach (Diaz 2007) that we will discuss later Demanding computation that requires the construction of a pointwise similarity matrix for all pair of documents in the index Variance (Zhao et al. 2008): var(t) - Variance of term t’s weights (e.g., tf.idf) over the documents containing it maxVar and sumVar are the maximum/average over query terms hypothesis: low variance implies to a difficult query due to low333312/07/2012 discriminative power SIGIR 2010, Geneva 33 SIGIR 2012 Tutorial: Query Performance Prediction
  34. 34. IBM Labs in HaifaTerm Relatedness (Hauff 2010) Hypothesis: If the query terms co-occur frequently in the collection we expect good performance Pointwise mutual information (PMI) is a popular measure of co-occurrence statistics of two terms in the collection It requires efficient tools for gathering collocation statistics from the corpus, to allow dynamic usage at query run-time. avgPMI(q)/maxPMI(q) measures the average and the maximum PMI over all pairs of terms in the query p(t1, t2 | D) PMI (t1, t2 ) log p(t1 | D) p(t2 | D)343412/07/2012 SIGIR 2010, Geneva 34 SIGIR 2012 Tutorial: Query Performance Prediction
  35. 35. IBM Labs in HaifaEvaluating pre-retrieval methods (Hauff 2010)Insights maxVAR and maxSCQ dominate other predictors, and are most stable over the collections and topic sets However, their performance significantly drops for one of the query sets (301-350) Prediction is harder over the Web collections (WT10G and GOV2) than over the news collection (ROBUST), probably due to higher heterogeneity of the data353512/07/2012 SIGIR 2010, Geneva 35 SIGIR 2012 Tutorial: Query Performance Prediction
  36. 36. Post-retrieval PredictorsIBM Haifa Research Lab © 2010 IBM Corporation
  37. 37. IBM Labs in HaifaPost-Retrieval Predictors Post-Retrieval Analyze the search results in addition to the query Methods Usually are more complex as the top-results are retrieved and analyzed Prediction quality depends on the retrieval process Clarity Robustness Score Analysis As different results are expected for the same query when using different retrieval methods In contrast to pre-retrieval methods, the Doc search results may depend on query- Query Perturb. Perturb. Retrieval Perturb. independent factors Such as document authority scores, search personalization etc. Post-retrieval methods can be categorized into three main paradigms: Clarity-based methods directly measure the coherence of the search results. Robustness-based methods evaluate how robust the results are to perturbations in the query, the result list, and the retrieval method. Score distribution based methods analyze the score distribution of the search results373712/07/2012 SIGIR 2010, Geneva 37 SIGIR 2012 Tutorial: Query Performance Prediction
  38. 38. IBM Labs in HaifaClarity (Cronen-Townsend et al. SIGIR 2002) Clarity measures the coherence (clarity) of the result-list with respect to the corpus Good results are expected to be focused on the querys topic Clarity considers the discrepancy between the likelihood of words most frequently used in retrieved documents and their likelihood in the whole corpus Good results - The language of the retrieved documents should be distinct from the general language of the whole corpus Bad results - The language of retrieved documents tends to be more similar to the general language Accordingly, clarity measures the KL divergence between a language model induced from the result list and that induced from the corpus383812/07/2012 SIGIR 2010, Geneva 38 SIGIR 2012 Tutorial: Query Performance Prediction
  39. 39. IBM Labs in HaifaClarity ComputationClarity (q ) KL( p(• | Dq ) || p (• | D)) KL divergence between p (t | Dq ) Dq and DClarity (q ) = ∑ t∈V ( D ) p(t | Dq ) log pMLE (t | D) tf (t , X ) X’s unsmoothed LM (MLE) pMLE (t | X ) |X|p (t | d ) = λ p MLE (t | d ) + (1 − λ ) p (t | D ) d’s smoothed LM p(t | Dq ) = ∑ p(t | d ) p(d | q) d∈Dq Dq’s LM (RM1) p(q | d ) p(d ) p(d | q) = ∝ p(q | d ) d’s “relevance” to q ∑d ∈D p(q | d ) p(d ) q p(q | d ) ∏ p(t | d ) t∈q query likelihood model 39 3912/07/2012 SIGIR 2010, Geneva 39 SIGIR 2012 Tutorial: Query Performance Prediction
  40. 40. IBM Labs in HaifaExample Consider the two queries variants for TREC topic 56 (from the TREC query track): Query A: Show me any predictions for changes in the prime lending rate and any changes made in the prime lending rates Query B: What adjustments should be made once federal action occurs?404012/07/2012 SIGIR 2010, Geneva 40 SIGIR 2012 Tutorial: Query Performance Prediction
  41. 41. IBM Labs in HaifaThe Clarity of 1)relevant results, 2) non-relevant results, 3)random documents 10 Relevant Non−relevant 9 Collection−wide 8 7 Query clarity score 6 5 4 3 2 1 0 300 350 400 450 Topic number414112/07/2012 SIGIR 2010, Geneva 41 SIGIR 2012 Tutorial: Query Performance Prediction
  42. 42. IBM Labs in HaifaNovel interpretation of clarity - Hummel et al. 2012 (See the Clarity Revisited poster!) Clarity ( q ) KL ( p (• | Dq ) || p (• | D ) = CE ( p (• | Dq ) || p (• | D )) − H ( p (• | Dq )) Cross entropy Entropy Clarity ( q ) = C Distance( q ) − L Diversity( q )424212/07/2012 SIGIR 2010, Geneva 42 SIGIR 2012 Tutorial: Query Performance Prediction
  43. 43. IBM Labs in HaifaClarity variants Divergence from randomness approach (Amati et al. ’02) Emphasize query terms (Cronen Townsend et al. ’04, Hauff et al. ’08) Only consider terms that appear in a very low percentage of all documents in the corpus (Hauff et al. ’08) Beneficial for noisy Web settings434312/07/2012 SIGIR 2010, Geneva 43 SIGIR 2012 Tutorial: Query Performance Prediction
  44. 44. IBM Labs in HaifaRobustness Robustness can be measured with respect to perturbations of the Query The robustness of the result list to small modifications of the query (e.g., perturbation of term weights) Documents Small random perturbations of the document representation are unlikely to result in major changes to documents retrieval scores If scores of documents are spread over a wide range, then these perturbations are unlikely to result in significant changes to the ranking Retrieval method In general, different retrieval methods tend to retrieve different results for the same query, when applied over the same document collection A high overlap in results retrieved by different methods may be related to high agreement on the (usually sparse) set of relevant results for the query. A low overlap may indicate no agreement on the relevant results; hence, query difficulty444412/07/2012 SIGIR 2010, Geneva 44 SIGIR 2012 Tutorial: Query Performance Prediction
  45. 45. Query Labs in Haifa IBM Perturbations Overlap between the query and its sub-queries (Yom Tov et al. SIGIR 2005) Observation: Some query terms have little or no influence on the retrieved documents, especially in difficult queries. The query feedback (QF) method (Zhou and Croft SIGIR 2007) models retrieval as a communication channel problem The input is the query, the channel is the search system, and the set of results is the noisy output of the channel A new query is generated from the list of results, using the terms with maximal contribution to the Clarity score, and then a second list of results is retrieved for that query The overlap between the two lists is used as a robustness score. Original R q SE q’ AP (q ) Overlap New R4512/07/2012 45 SIGIR 2012 Tutorial: Query Performance Prediction
  46. 46. Document Perturbation (Zhou & Croft CIKM 06) IBM Labs in Haifa(A conceptually similar approach, but which uses a different technique, was presented byVinay et al. in SIGIR 06) How stable is the ranking in the presence of uncertainty in the ranked documents? Compare a ranked list from the original collection to the corresponding ranked list from the corrupted collection using the same query and ranking function docs’ d d’ D lang. model D’ q SE d1 d’1 d2 d’2 … Compare … dk d’k464612/07/2012 SIGIR 2010, Geneva 46 SIGIR 2012 Tutorial: Query Performance Prediction
  47. 47. Cohesioninof the Result list – Clustering tendency IBM Labs Haifa(Vinay et al. SIGIR 2006) The cohesion of the result list can be measured by its clustering patterns Following the ``cluster hypothesis which implies that documents relevant to a given query are likely to be similar to one another A good retrieval returns a single, tight cluster, while poor retrieval returns a loosely related set of documents covering many topics The ``clustering tendency of the result set Corresponds to the Cox-Lewis statistic which measures the ``randomness level of the result list Measured by the distance between a randomly selected document and it’s nearest neighbor from the result list. When the list contains “inherent” clusters, the distance between the random document and its closest neighbor is likely to be much larger than the distance between this neighbor and its own nearest neighbor in the list474712/07/2012 SIGIR 2010, Geneva 47 SIGIR 2012 Tutorial: Query Performance Prediction
  48. 48. Retrieval in Haifa IBM Labs Method Perturbation (Aslam & Pavlu,ECIR 2007) Query difficulty is predicted by submitting the query to different retrieval methods and measuring the diversity of the retrieved ranked lists Each ranking is mapped to a distribution over the document collection JSD distance is used to measure the diversity of these distributions Evaluation: Submissions of all participants to several TREC tracks were analyzed The agreement between submissions highly correlates with the query difficulty, as measured by the median performance (AP) of all participants The more submissions are analyzed, the prediction quality improves.484812/07/2012 SIGIR 2010, Geneva 48 SIGIR 2012 Tutorial: Query Performance Prediction
  49. 49. IBM Labs in HaifaScore Distribution Analysis Often, the retrieval scores reflect the similarity of documents to a query Hence, the distribution of retrieval scores can potentially help predict query performance. Naïve predictors: The highest retrieval score or the mean of top scores (Thomlison 2004) The difference between query-independent scores and a query- dependent scores Reflects the ``discriminative power of the query (Bernstein et al. 2005)494912/07/2012 SIGIR 2010, Geneva 49 SIGIR 2012 Tutorial: Query Performance Prediction
  50. 50. IBM Labs in HaifaSpatial autocorrelation (Diaz SIGIR 2007) Query performance is correlated with the extent to which the result list “respects” the cluster hypothesis The extent to which similar documents receive similar retrieval scores In contrast, a difficult query might be detected when similar documents are scored differently. A documents ``regularized retrieval score is determined based on the weighted sum of the scores of its most similar documents ScoreRe g (q, d ) ∑Sim(d, d )Score(q, d ) dThe linear correlation of the regularized scores with the original scores is used forquery-performance prediction.505012/07/2012 SIGIR 2010, Geneva 50 SIGIR 2012 Tutorial: Query Performance Prediction
  51. 51. Weighted Haifa IBM Labs in Information Gain –WIG(Zhou & Croft 2007) WIG measures the divergence between the mean retrieval score of top- ranked documents and that of the entire corpus. The more similar these documents are to the query, with respect to the corpus, the more effective the retrieval The corpus represents a general non-relevant document p(t | d ) 1   WIG(q) 1 k ∑k ∑λt log p(t | D)  avg ( Score(d )) − Score(D)  k | q |  d∈Dqk d∈Dq t∈q  k Dq is the list of the k highest ranked documents λt reflects the weight of the terms type When all query terms are simple keywords, this parameter collapses to 1/sqrt(|q|) WIG was originally proposed and employed in the MRF framework However, for a bag-of-words representation MRF reduces to the query likelihood model and is effective (Shtok et al. ’07, Zhou ’07)515112/07/2012 SIGIR 2010, Geneva 51 SIGIR 2012 Tutorial: Query Performance Prediction
  52. 52. Normalized Query Commitment - NQC IBM Labs in Haifa(Shtok et al. ICTIR 2009 and TOIS 2012) NQC estimates the presumed amount of query drift in the list of top-retrieved documents Query expansion often uses a centroid of the list Dq as an expanded query model The centroid usually manifests query drift (Mitra et al. ‘98) The centroid could be viewed as a prototypical misleader as it exhibits (some) similarity to the query This similarity is dominated by non-query-related aspects that lead to query drift Shtok et al. showed that the mean retrieval score of documents in the result list corresponds, in several retrieval methods, to the retrieval score of some centroid-based representation of Dq Thus, the mean score represents the score of a prototypical misleader The standard deviation of scores which reflects their dispersion around the mean, represents the divergence of retrieval scores of documents in the list from that of a non- relevant document that exhibits high query similarity (the centroid). 1 ∑ ( Score(d ) − µ ) 2 k k d ∈Dq NQC (q ) Score( D )525212/07/2012 SIGIR 2010, Geneva 52 SIGIR 2012 Tutorial: Query Performance Prediction
  53. 53. IBM Labs in Haifa A geometric interpretation of NQC535312/07/2012 SIGIR 2010, Geneva 53 SIGIR 2012 Tutorial: Query Performance Prediction
  54. 54. EvaluatingHaifa IBM Labs in post-retrieval methods(Shtok et al. TOIS 2012) Prediction quality (Pearson correlation and Kendall’s tau) Similarly to pre-retrieval methods, there is no clear winner QF and NQC exhibit comparable results NQC exhibits good performance over most collections but does not perform very well on GOV2. QF performs well over some of the collections but is inferior to other predictors on ROBUST.545412/07/2012 SIGIR 2010, Geneva 54 SIGIR 2012 Tutorial: Query Performance Prediction
  55. 55. IBM Labs in Haifa Additional predictors that analyze the retrieval scores distribution Computing the standard deviation of retrieval scores at a query- dependent cutoff Perez-Iglesias and Araujo SPIRE 2010 Cummins et al. SIGIR 2011 Computing expected ranks for documents Vinay et al. CIKM 2008 Inferring AP directly from the retrieval score distribution Cummins AIRS 2011555512/07/2012 SIGIR 2010, Geneva 55 SIGIR 2012 Tutorial: Query Performance Prediction
  56. 56. Combining PredictorsIBM Haifa Research Lab © 2010 IBM Corporation
  57. 57. IBM Labs in HaifaCombining post-retrieval predictors Some integration efforts of a few predictors based on linear regression: Yom-Tov et al (SIGIR’05) combined avgIDF with the Overlap predictor Zhou and Croft (SIGIR’07) integrated WIG and QF using a simple linear combination Diaz (SIGIR’07) incorporated the spatial autocorrelation predictor with Clarity and with the document perturbation based predictor In all those trials, the results of the combined predictor were much better than the results of the single predictors This suggests that these predictors measure (at least semi) complementary properties of the retrieved results575712/07/2012 SIGIR 2010, Geneva 57 SIGIR 2012 Tutorial: Query Performance Prediction
  58. 58. Utility Labs in Haifa IBM estimation framework (UEF) for queryperformance prediction (Shtok et al. SIGIR 2010) Suppose that we have a true model of relevance, RI , for the q information need I q that is represented by the query q Then, the ranking π ( Dq , RI q ) , induced over the given result list using RI , is the most effective for these documents q Accordingly, the utility (with respect to the information need) provided by the given ranking, which can be though of as reflecting query performance, can be defined as U ( Dq | I q ) Similarity ( Dq , π ( Dq , RI )) q In practice, we have no explicit knowledge of the underlying information need and of RIq Using statistical decision theory principles, we can approximate the utility by estimating RIq585812/07/2012 q q ∫ ˆ ( U ( D | I ) ≈ Similarity D , π ( D , R ) p ( R | I )dR ˆ q ˆ ˆ q q SIGIR 2010, Geneva ) q q q 58 Rq SIGIR 2012 Tutorial: Query Performance Prediction
  59. 59. Instantiating predictors IBM Labs in Haifa U ( D | I ) ≈ ∫ Similarity ( D , π ( D , R ) ) p ( R q q q ˆ q ˆ q q ˆ | I q )dRq ˆ Rq ˆ Relevance-model estimates (Rq ) relevance language models constructed from documents sampled from the highest ranks of some initial ranking Estimate the relevance-model presumed “representativeness” of ˆ the information need (p( Rq | I q ) ) apply previously proposed predictors (Clarity, WIG, QF, NQC) upon the sampled documents from which the relevance model is constructed Inter-list similarity measures Pearson, Kendall’s-tau, Spearman A specific, highly effective, instantiated predictor: Construct a single relevance model from the given result list, Dq Use a previously proposed predictor upon Dq to estimate relevance- model effectiveness Use Pearson’s correlation between retrieval scores for the similarity59 SIGIR 2010, Geneva 595912/07/2012 measure SIGIR 2012 Tutorial: Query Performance Prediction
  60. 60. IBM Labs in HaifaThe UEF framework - A flow diagram Dq q SE Model ˆ d1 Sampling R q ;S d2 Re-Ranking … dk π( Dq , Rq ;S ) ˆ Relevance d’1 Estimator d’2 … d’k Performance AP (q ) Ranking predictor Similarity606012/07/2012 SIGIR 2010, Geneva 60 SIGIR 2012 Tutorial: Query Performance Prediction
  61. 61. Prediction in Haifa IBM Labs quality of UEF (Pearson correlation with true AP) Clarity->UEF(Clarity) (+31.4%) WIG->UEF(WIG) (+27.8%) 0.75 0.75 Clarity WIG 0.7 0.7 UEF(Clarity) UEF(WIG) 0.65 0.65 0.6 0.6 0.55 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 TREC4 TREC5 WT10G ROBUST GOV2 TREC4 TREC5 WT10G ROBUST GOV2 NQC->UEF(NQC) (+17.7%) QF->UEF(QF) (+24.7%) 0.75 0.75 QF 0.7 NQC 0.7 0.65 UEF(QF) 0.65 UEF(NQC) 0.6 0.6 0.55 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 TREC4 TREC5 WT10G ROBUST GOV2 TREC4 TREC5 WT10G ROBUST GOV2 61 61
  62. 62. IBM Labs in Haifa * Prediction quality for ClueWeb (Category A) TREC 2009 TREC 2010 LM LM+SpamRm LM LM+SpamRm .465 .526 .302 .312SumVar (Zhao et al. 08)SumIDF .463 .524 .294 .292Clarity .017 -.178 -.385 -.111UEF(Clarity) .124 .473 .303 .295ImpClarity (Hauff et al. ’08) .348 .133 .072 -.008UEF(ImpClarity) .221 .580 .340 .366WIG .423 .542 .269 .349UEF(WIG) .236 .651 .375 .414NQC .083 .430 .269 .214UEF(NQC) .154 .633 .342 .460QF .494 .630 .368 .617UEF(QF) .637 .708 .358 .649 * Thanks to Fiana Raiber for producing the results 62 62
  63. 63. IBM Labs in HaifaAre the various post-retrieval predictors thatdifferent from each other?A unified framework for explaining post-retrievalpredictors (Kurland et al. ICTIR 2011) 63 63
  64. 64. A unified post-retrieval prediction framework IBM Labs in Haifa We want to predict the effectiveness of a ranking π M (q; D) of the corpus that was induced by retrieval method M in response to query q Assume a true model of relevance RIq that can be used for retrieval ⇒ the resultant ranking, π opt (q; D), is of optimal utility Utility (π M (q; D); I q ) sim(π M (q; D), π opt (q; D)) Use as reference comparisons a pseudo effective (PE) and pseudo ineffective (PIE) rankings (cf. Rocchio 71): Utility (π M (q; D); I q ) α (q)sim(π M (q; D), π PE (q; D)) − β (q) sim(π M (q; D), π PIE (q; D)) ˆ Pseudo –Effective ranking πPE(q;D) Corpus ranking πM(q;D) … … Pseudo -Ineffective ranking πPIE(q;D) …6412/07/2012 64 SIGIR 2012 Tutorial: Query Performance Prediction
  65. 65. IBM Labs in Haifa Instantiating predictors Focus on the result lists of the documents most highly ranked by each ranking U (π M (q; D); I q ) α (q)sim( Dq , LPE ) − β (q)sim( Dq , LPIE ) ˆ Deriving predictors: 1. "Guess" a PE result list (LPE ) and/or a PIE result list (LPIE ) 2. Select weights α (q) and β (q) 3. Select an inter-list (ranking) similarity measure - Pearsons r, Kendalls-τ , Spearmans-ρ6512/07/2012 65 SIGIR 2012 Tutorial: Query Performance Prediction
  66. 66. IBM Labs in Haifa U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE ) ˆ Basic idea: The PIE result list is composed of k copies of a pseudo ineffective document Predictor Pseudo Predictor’s Sim. α(q) β(q) Ineffective description measure documentClarity Corpus Estimates the focus of the result -KL diverg. 0 1(Cronen-Townsend list with respect to the corpus between’02, ‘04) language modelsWIG Corpus Measures the difference between -L1 distance 0 1(Weighted the retrieval scores of documents of retrievalInformation Gain;Zhou & Croft ‘07) in the result list and the score of scores the corpusNQC Result list Measures the standard deviation -L2 distance 0 1(Normalized centroid of the retrieval scores of of retrievalQuery Commitment; scoresShtok et al. ’09) documents in the result list6612/07/2012 66 SIGIR 2012 Tutorial: Query Performance Prediction
  67. 67. IBM Labs in Haifa U (π M (q; D ); I q ) α ( q ) sim( LM , LPE ) − β (q ) sim( LM , LPIE ) ˆ Predictor Pseudo Predictor’s Sim. α(q) β(q) Effective result description measure listQF (Query Use a relevance Measures the “amount of Overlap at 1 0Feedback; Zhou model for noise“ (non-query-related top ranks& Croft ’07) retrieval over aspects) in the result list the corpusUEF (Utility Re-rank the Estimates the potential utility Rank/Scores Presumed 0 representatEstimation given result list of the result list using correlation iveness of theFramework; using a relevance models relevance modelShtok et al. ‘10) relevance modelAutocorrelation 1.Score 1. The degree to which Pearson 1 0(Diaz ‘07) regularization retrieval scores “respect correlation the cluster hypothesis” between 2.Fusion 2. Similarity with a fusion – scores based result list6712/07/2012 67 SIGIR 2012 Tutorial: Query Performance Prediction
  68. 68. IBM Labs in Haifa A General Model for Query Difficulty686812/07/2012 SIGIR 2010, Geneva 68 SIGIR 2012 Tutorial: Query Performance Prediction
  69. 69. A model of query difficulty (Carmel et al, SIGIR IBM Labs in Haifa2006) A user with a given information need (a topic): Submits a query to a search engine Judges the search results Topic according to their relevance to this information need. Thus, the query/ies and the Qrels Queries (Q) Documents (R) are two sides of the same information need Qrels also depend on the existing collection (C) Search engine Define: Topic = (Q,R|C)696912/07/2012 SIGIR 2010, Geneva 69 SIGIR 2012 Tutorial: Query Performance Prediction
  70. 70. IBM Labs in HaifaA Theoretical Model of Topic Difficulty Main Hypothesis Topic difficulty is induced from the distances between the model parts707012/07/2012 SIGIR 2010, Geneva 70 SIGIR 2012 Tutorial: Query Performance Prediction
  71. 71. Model validation: The Pearson correlation between IBM Labs in Haifa Average Precision and the model distances (see the paper for estimates of the various distances) Based on the .gov2 collection (25M docs) and 100 topics of the Terabyte tracks 04/05 Topic Queries Documents -0.06 0.15 Combined: 0.45 0.17 0.32717112/07/2012 SIGIR 2010, Geneva 71 SIGIR 2012 Tutorial: Query Performance Prediction
  72. 72. A probabilistic framework for QPP (Kurland et al. IBM Labs in HaifaCIKM 2012, to appear)727212/07/2012 SIGIR 2010, Geneva 72 SIGIR 2012 Tutorial: Query Performance Prediction
  73. 73. IBM Labs in HaifaPost-retrieval prediction (Kurland et al. 2012) query-independent result list properties (e.g., cohesion/dispersion) WIG and Clarity are derived from this expression737312/07/2012 SIGIR 2010, Geneva 73 SIGIR 2012 Tutorial: Query Performance Prediction
  74. 74. IBM Labs in HaifaUsing reference lists (Kurland et al. 2012) The presumed quality of the reference list (a prediction task at its own right)747412/07/2012 SIGIR 2010, Geneva 74 SIGIR 2012 Tutorial: Query Performance Prediction
  75. 75. Applications of Query Difficulty EstimationIBM Haifa Research Lab © 2010 IBM Corporation
  76. 76. IBM Labs in HaifaMain applications for QDE Feedback to the user and the system Federation and metasearch Content enhancement using missing content analysis Selective query expansion/ query selection Others767612/07/2012 SIGIR 2010, Geneva 76 SIGIR 2012 Tutorial: Query Performance Prediction
  77. 77. IBM Labs in HaifaFeedback to the user and the system Direct feedback to the user “We were unable to find relevant documents to your query” Estimating the value of terms from query refinement Suggest which terms to add, and estimate their value Personalization Which queries would benefit from personalization? (Teevan et al. ‘08)777712/07/2012 SIGIR 2010, Geneva 77 SIGIR 2012 Tutorial: Query Performance Prediction
  78. 78. IBM Labs in HaifaFederation and metasearch: Problem definition Merged ranking Given several databases that might contain information Federate relevant to a given question, results How do we construct a good unified list of answers from all these datasets? Search Search Search Similarly, given a set of search engine engine engine engines employed for the same query. How do we merge their results in an optimal manner? Collection Collection Collection787812/07/2012 SIGIR 2010, Geneva 78 SIGIR 2012 Tutorial: Query Performance Prediction
  79. 79. IBM Labs in HaifaPrediction-based Federation & Metasearch Merged Weight each result ranking set by it’s predicted precision Federate (Yom-Tov et al., results 2005; Sheldon et al. 11) or Search Search Search select the best engine engine engine predicted SE (White et al., 2008, Berger&Savoy, 2007) Collection Collection Collection797912/07/2012 SIGIR 2010, Geneva 79 SIGIR 2012 Tutorial: Query Performance Prediction
  80. 80. Metasearch and Federation using Query IBM Labs in HaifaPrediction Train a predictor for each engine/collection pair For a given query: Predict its difficulty for each engine/collection pair Weight the results retrieved from each engine/collection pair accordingly, and generate the federated list808012/07/2012 SIGIR 2010, Geneva 80 SIGIR 2012 Tutorial: Query Performance Prediction
  81. 81. IBM Labs in HaifaMetasearch experiment The LA Times collection was P@10 %no indexed using four search engines. A predictor was trained for each Single SE 1 0.139 47.8 search engine. search For each query the results-list from engine SE 2 0.153 43.4 each search engine was weighted SE 3 0.094 55.2 using the prediction for the specific queries. SE 4 0.171 37.4 The final ranking is a ranking of the union of the results lists, weighted Meta- Round- 0.164 45.0 by the prediction. search robin Meta- 0.163 34.9 Crawler Prediction- 0.183 31.7 based818112/07/2012 SIGIR 2010, Geneva 81 SIGIR 2012 Tutorial: Query Performance Prediction
  82. 82. IBM Labs in HaifaFederation for the TREC 2004 Terabyte track GOV2 collection: 426GB, 25 million documents 50 topics For federation, divided into 10 partitions of roughly equal size P@10 MAP One collection 0.522 0.292 Prediction-based federation 0.550 0.264 Score-based federation 0.498 0.257 * 10-fold cross-validation828212/07/2012 SIGIR 2010, Geneva 82 SIGIR 2012 Tutorial: Query Performance Prediction
  83. 83. Content enhancement using missing content IBM Labs in Haifaanalysis Information External need content Missing Search Data content engine gathering estimation Repository Quality estimation838312/07/2012 SIGIR 2010, Geneva 83 SIGIR 2012 Tutorial: Query Performance Prediction
  84. 84. IBM Labs in HaifaStory line A helpdesk system where system administrators try to find relevant solutions in a database The system administrators’ queries are logged and analyzed to find topics of interest that are not covered in the current repository Additional content for topics which are lacking is obtained from the internet, and is added to the local repository848412/07/2012 SIGIR 2010, Geneva 84 SIGIR 2012 Tutorial: Query Performance Prediction
  85. 85. AddingLabs in Haifa where it is missing improves IBM contentprecision Cluster user queries Add content according to: Lacking content Size Random Adding content where it is missing brings the best improvement in precision858512/07/2012 SIGIR 2010, Geneva 85 SIGIR 2012 Tutorial: Query Performance Prediction
  86. 86. IBM Labs in HaifaSelective query expansion Automatic query expansion: Improving average quality by adding search terms Pseudo-relevance feedback (PRF): Considers the (few) top-ranked documents to be relevant, and uses them to expand the query. PRF works well on average, but can significantly degrade retrieval performance for some queries. PRF fails because of Query Drift: Non-relevant documents infiltrate into the top results or relevant results contain aspects irrelevant of the query’s original intention. Rocchio’s method: The expanded query is obtained by combining the original query and the centroid of the top results, adding new terms to the query and reweighting the original terms. Qm = αQOriginal + β ∑D D j ∈DRe levant j −γ ∑D k Dk ∈DIrrelevant868612/07/2012 SIGIR 2010, Geneva 86 SIGIR 2012 Tutorial: Query Performance Prediction
  87. 87. IBM Labs in HaifaSome work on selective query expansion Expand only “easy” queries Predict which are the easy queries and only expand them (Amati et al. 2004, Yom-Tov et al. 2005) Estimate query drift (Cronen-Townsend et al., 2006) Compare the expanded list with the unexpanded list to estimate if too much noise was added by using expansion Selecting a specific query expansion form from a list of candidates (Winaver et al. 2007) Integrating various query-expansion forms by weighting them using query-performance predictors (Soskin et al., 2009) Estimate how much expansion (Lv and Zhai, 2009) Predict the best α in the Rocchio formula using features of the query, the documents, and the relationship between them But, see Azzopardi and Hauff 2009878712/07/2012 SIGIR 2010, Geneva 87 SIGIR 2012 Tutorial: Query Performance Prediction
  88. 88. IBM Labs in HaifaOther uses of query difficulty estimation Collaborative filtering (Bellogin & Castells, 2009): Measure the lack of ambiguity in a users’ preference. An “easy” user is one whose preferences are clear-cut, and thus contributes much to a neighbor. Term selection (Kumaran & Carvalho, 2009): Identify irrelevant query terms Queries with fewer irrelevant terms tend to have better results Reducing long queries (Cummins et al., 2011)888812/07/2012 SIGIR 2010, Geneva 88 SIGIR 2012 Tutorial: Query Performance Prediction
  89. 89. Summary & ConclusionsIBM Haifa Research Lab © 2010 IBM Corporation
  90. 90. IBM Labs in HaifaSummary In this tutorial, we surveyed the current state-of-the-art research on query difficulty estimation for IR We discussed the reasons that Cause search engines to fail for some of the queries Bring about a high variability in performance among queries as well as among systems We summarized several approaches for query performance prediction Pre-retrieval methods Post-retrieval methods Combining predictors We reviewed evaluation metrics for prediction quality and the results of various evaluation studies conducted over several TREC benchmarks. These results show that state-of-the-art existing predictors are able to identify difficult queries by demonstrating a reasonable prediction quality However, prediction quality is still moderate and should be substantially improved in order to be widely used in IR tasks909012/07/2012 SIGIR 2010, Geneva 90 SIGIR 2012 Tutorial: Query Performance Prediction
  91. 91. IBM Labs in HaifaSummary – Main Results Current linguistic-based predictors do not exhibit meaningful correlation with query performance This is quite surprising as intuitively poor performance can be expected for ambiguous queries In contrast, statistical pre-retrieval predictors such as SumSCQ, and maxVAR have relatively significant predictive ability These pre-retrieval predictors, and a few others, exhibit comparable performance to post-retrieval methods such as Clarity, WIG, NQC, QF over large scale Web collections This is counter-intuitive as post-retrieval methods are exposed to much more information than pre-retrieval methods However, current state-of-the-art predictors still suffer from low robustness in prediction quality This robustness problem, as well as the moderate prediction quality of existing predictors, are two of the greatest challenges in query difficulty prediction, that should be further explored in the future919112/07/2012 SIGIR 2010, Geneva 91 SIGIR 2012 Tutorial: Query Performance Prediction
  92. 92. IBM Labs in HaifaSummary (cont) We tested whether combining several predictors together may improve prediction quality Especially when the different predictors are independent and measure different aspects of the query and the search results An example for a combination method is using linear regression The regression task is to learn how to optimally combine the predicted performance values in order to best fit them to the actual performance values Results were moderate, probably due to the sparseness of the training data, which over-represents the lower end of the performance values We discussed three frameworks for query-performance prediction Utility Estimation Framework (UEF); Shtok et al. 2010 State-of-the-art prediction quality A unified framework for post-retrieval prediction that sets common grounds for various previously-proposed predictors; Kurland et al. 2011 A fundamental framework for estimating query difficulty; Carmel et al. 2006929212/07/2012 SIGIR 2010, Geneva 92 SIGIR 2012 Tutorial: Query Performance Prediction
  93. 93. IBM Labs in HaifaSummary (cont) We discussed a few applications that utilize query difficulty estimators Handling each query individually based on its estimated difficulty Find the best terms for query refinement by measuring the expected gain in performance for each candidate term Expand the query or not based on predicted performance of the expanded query Personalize the query selectively only in cases that personalization is expected bring value Collection enhancement guided by the identification of missing content queries Fusion of search results from several sources based on their predicted quality939312/07/2012 SIGIR 2010, Geneva 93 SIGIR 2012 Tutorial: Query Performance Prediction
  94. 94. IBM Labs in HaifaWhat’s Next? Predicting the performance for other query types Navigational Queries XML queries (Xquery, Xpath) Domain Specific queries (e.g .Healthcare) Considering other factors that may affect query difficulty Who is the person behind the query? in what context? Geo-spatial features Temporal aspects Personal parameters Query difficulty in other search paradigms Multifaceted search Exploratory Search949412/07/2012 SIGIR 2010, Geneva 94 SIGIR 2012 Tutorial: Query Performance Prediction
  95. 95. IBM Labs in HaifaConcluding Remarks Research on query difficulty estimation has begun only ten years ago with the pioneering work on the Clarity predictor (2002) Since then this subfield has found its place at the center of IR research These studies have revealed alternative prediction approaches, new evaluation methodologies, and novel applications In this tutorial we covered Existing performance prediction methods Some evaluation studies Potential applications Some anticipations on future directions in the field While the progress we see is enormous already, performance prediction is still challenging and far from being solved Much more accurate predictors are required in order to be widely adopted by IR tasks We hope that this tutorial will contribute in increasing the interest in query difficulty estimation959512/07/2012 SIGIR 2010, Geneva 95 SIGIR 2012 Tutorial: Query Performance Prediction
  96. 96. Thank You!IBM Haifa Research Lab © 2010 IBM Corporation