Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stochastic Simulation of Test Collections: Evaluation Scores

301 views

Published on

Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only removes the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.

Published in: Science
  • Be the first to comment

Stochastic Simulation of Test Collections: Evaluation Scores

  1. 1. SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
  2. 2. MOTIVATION
  3. 3. Experiments in IR 3 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  4. 4. Experiments in IR 4 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  5. 5. Experiments in IR 5 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
  6. 6. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control 6 How we Make Do
  7. 7. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificially create other collections by resampling from the existing data 7 How we Make Do Limited to dozens of systems and topics from past evaluations like TREC s t ? ?
  8. 8. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Split in two topic sets and consider results with one subset as the truth 8 How we Make Do Don’t know true properties of systems, such as mean or variance over topics s t 𝑿 𝑿 ? 𝑿 𝑿 ?
  9. 9. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificial modifications of effectiveness scores that lead to invalid data 9 How we Make Do Can’t control properties of systems, such as true mean. Systems are how they are ? -1 10 -1 10
  10. 10. PROPOSAL 10
  11. 11. Stochastic Simulation 11 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model AP P@10
  12. 12. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control Stochastic Simulation 12 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  13. 13. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control • Fit the model to existing data to make it realistic • Needs to be flexible to model real data Stochastic Simulation 13 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
  14. 14. Model • …of the joint distribution of system scores • We use copula models, which separate: 1.Marginal distributions, of individual systems 2.Dependence structure, among systems • Easy to customize: plug and play simulate 14
  15. 15. • Fit the model: Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  16. 16. • Fit the model: 1. Fit the margins Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
  17. 17. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  18. 18. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
  19. 19. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will Model 16 *generalizes to several systems
  20. 20. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations Model 16 𝑉 𝑈 *generalizes to several systems
  21. 21. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations 2. Turn into effectiveness scores Model 16 𝑌 = 𝐹𝑌 −1 𝑉 𝑉 𝑋=𝐹𝑋 −1 𝑈 𝑈 *generalizes to several systems
  22. 22. Modeling Dependences • Gaussian copulas – Only correlation – Only symmetric • R-Vine copulas – Allows tail dependence – Allows asymmetricity – Built from pair-copulas (bivariate) – Eg: F(S1,S2), F(S4,S2|S1), F(S4,S3|S1,S2), … – ~40 alternatives based on 12 different families 17
  23. 23. Modeling Margins • All effectiveness measures have discrete distributions, but for some we can fairly assume they’re continuous –AP, nDCG • For some others, this assumption is clearly wrong, so we must preserve the support –P@10: 1, 0.9, 0.8, … –RR: 1, 1/2, 1/3, … 18
  24. 24. Modeling Margins Continuous • (Truncated) Normal • Beta • (Truncated) Normal Kernel Smoothing • Beta Kernel Smoothing Discrete • Beta-Binomial • Discrete Kernel Smoothing • Discrete Kernel Smoothing w/ controlled smoothness 19
  25. 25. Transform to Predefined Mean Problem: given 𝐹, transform to 𝐹 such that 𝝁 = 𝝁∗ and preserving the support Solution: transform with a specific Beta find 𝛼, 𝛽 > 1 such that 𝜇 = 𝜇∗ where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽 20
  26. 26. RESULTS 21
  27. 27. Data • TREC Web Ad hoc runs 2010-2014 – 50 topics and 30-88 systems each – 12924 total system-topic pairs • Continuous measures: AP, nDCG@20, ERR@20 • Discrete measures: P@10, P@20, RR • Points of Interest 1. Margins 2. Copulas 3. Simulated scores 22
  28. 28. • 1572 system-measure pairs • 5425 models successfully fitted • Log-Likelihood: • Kernel Smoothing (esp. discrete) • Normal & Beta 25% of cases • AIC and BIC: • Normal & Beta 67% of cases • Beta-Binomial 50% of P@k • Transform all to the mean in the given data and select again: • Kernel Smoothing nearly always 1. Margins 23
  29. 29. • 39627 system pairs • Fit pair-copulas and select according to Log-Likelihood • Wide diversity • Gaussian copulas rarely selected; correlation is not enough • Complex models are preferred 2. Dependence 24
  30. 30. • Simulate 1000 new topics and record deviations from the model • 𝜇 − 𝑋 and 𝜎2 − 𝑠2 • Repeat 1000 times • Full knowledge of truth encoded in the model 3. Simulation: Scores 25
  31. 31. • Web 2010, nDCG@20 • Simulate 500 new topics • Dependence captured in the model 3. Simulation: Dependencies 26
  32. 32. SAMPLE APPLICATIONS
  33. 33. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
  34. 34. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
  35. 35. [Voorhees, 2009] 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  36. 36. [Voorhees, 2009] • Limited data • Unknown truth (is H0 true?) • No control over H0 • Cannot measure Type I error rates directly • Conflict rates at α=5%: • AP: 2.8% • P@10: 10.9% 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
  37. 37. [With simulation] Same margins 1. Type I Errors 29
  38. 38. [With simulation] Same margins 1. Type I Errors 29
  39. 39. [With simulation] Same margins 1. Type I Errors 29
  40. 40. [With simulation] Same margins 1. Type I Errors 29
  41. 41. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% 1. Type I Errors 29 p-value Type I error?
  42. 42. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  43. 43. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  44. 44. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
  45. 45. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 5% and 1% 1. Type I Errors 30 p-value Type I error?
  46. 46. 2. Sneak Peak: Statistical Power and σ 31 [Webber et al, 2008] • Show empirical evidence of the problem of sequential testing • Limited data • Unknown truth (true σ)
  47. 47. TAKE HOME
  48. 48. Today • Part of Evaluation Research has data-related limitations –Lack of data, no knowledge of truth, no control –How valid are our results? • We propose a methodology for stochastic simulation to eliminate these limitations –Flexible, realistic, highly customizable –Allows us to study new problems, directly 33
  49. 49. Tomorrow • Even more flexibility • Simulate new systems for given topics • Add third factors –Fixed: already possible –Random: we’ll see • Simulate full runs (doc scores & relevance) 34
  50. 50. simIReff • All results fully reproducible • Developed a full R-package for simulation https://github.com/julian-urbano/simIReff effs <- effDiscFitAndSelect(data, support("p20")) cop <- effcopFit(data, effs) y <- reffcop(1000, cop) 35

×