Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 280360 views
- AI and Machine Learning Demystified... by Carol Smith 3335823 views
- 10 facts about jobs in the future by Pew Research Cent... 523206 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 855927 views
- Harry Surden - Artificial Intellige... by Harry Surden 482204 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1064856 views

301 views

Published on

Published in:
Science

No Downloads

Total views

301

On SlideShare

0

From Embeds

0

Number of Embeds

22

Shares

0

Downloads

3

Comments

0

Likes

1

No embeds

No notes for slide

- 1. SIGIR 2018 · July 11th · Ann ArborPicture by qiyang
- 2. MOTIVATION
- 3. Experiments in IR 3 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
- 4. Experiments in IR 4 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
- 5. Experiments in IR 5 Core research “how well?” input IR systems Evaluation research “what if?” output test collection AP P@10 conditions input no. topics stat. signif. tests output AP P@10 p-values Kendall τ conditions
- 6. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control 6 How we Make Do
- 7. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificially create other collections by resampling from the existing data 7 How we Make Do Limited to dozens of systems and topics from past evaluations like TREC s t ? ?
- 8. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Split in two topic sets and consider results with one subset as the truth 8 How we Make Do Don’t know true properties of systems, such as mean or variance over topics s t 𝑿 𝑿 ? 𝑿 𝑿 ?
- 9. Current Limitations 1.Finite data 2.Unknown Truth 3.Lack of Control Artificial modifications of effectiveness scores that lead to invalid data 9 How we Make Do Can’t control properties of systems, such as true mean. Systems are how they are ? -1 10 -1 10
- 10. PROPOSAL 10
- 11. Stochastic Simulation 11 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model AP P@10
- 12. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control Stochastic Simulation 12 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
- 13. • Build a generative model of the joint distribution of system scores • Simulate scores on new, random topics • Lack of data • Unknown truth • Lack of control • Fit the model to existing data to make it realistic • Needs to be flexible to model real data Stochastic Simulation 13 Core research “how well?” IR systems Evaluation research “what if?” test collection AP P@10 no. topics stat. signif. tests AP P@10 p-values Kendall τ Model
- 14. Model • …of the joint distribution of system scores • We use copula models, which separate: 1.Marginal distributions, of individual systems 2.Dependence structure, among systems • Easy to customize: plug and play simulate 14
- 15. • Fit the model: Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
- 16. • Fit the model: 1. Fit the margins Model 15 𝑌1, … , 𝑌𝑛 𝑋1,…,𝑋n *generalizes to several systems
- 17. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
- 18. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula Model 15 𝑌1, … , 𝑌𝑛 𝑉𝑖 = 𝐹𝑌 𝑌𝑖 𝑋1,…,𝑋n 𝑈𝑖=𝐹𝑋𝑋𝑖 *generalizes to several systems
- 19. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will Model 16 *generalizes to several systems
- 20. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations Model 16 𝑉 𝑈 *generalizes to several systems
- 21. • Fit the model: 1. Fit the margins 2. Turn to pseudo- observations 3. Fit the copula • Or instantiate at will • Simulate from the model: 1. Generate pseudo- observations 2. Turn into effectiveness scores Model 16 𝑌 = 𝐹𝑌 −1 𝑉 𝑉 𝑋=𝐹𝑋 −1 𝑈 𝑈 *generalizes to several systems
- 22. Modeling Dependences • Gaussian copulas – Only correlation – Only symmetric • R-Vine copulas – Allows tail dependence – Allows asymmetricity – Built from pair-copulas (bivariate) – Eg: F(S1,S2), F(S4,S2|S1), F(S4,S3|S1,S2), … – ~40 alternatives based on 12 different families 17
- 23. Modeling Margins • All effectiveness measures have discrete distributions, but for some we can fairly assume they’re continuous –AP, nDCG • For some others, this assumption is clearly wrong, so we must preserve the support –P@10: 1, 0.9, 0.8, … –RR: 1, 1/2, 1/3, … 18
- 24. Modeling Margins Continuous • (Truncated) Normal • Beta • (Truncated) Normal Kernel Smoothing • Beta Kernel Smoothing Discrete • Beta-Binomial • Discrete Kernel Smoothing • Discrete Kernel Smoothing w/ controlled smoothness 19
- 25. Transform to Predefined Mean Problem: given 𝐹, transform to 𝐹 such that 𝝁 = 𝝁∗ and preserving the support Solution: transform with a specific Beta find 𝛼, 𝛽 > 1 such that 𝜇 = 𝜇∗ where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽 20
- 26. RESULTS 21
- 27. Data • TREC Web Ad hoc runs 2010-2014 – 50 topics and 30-88 systems each – 12924 total system-topic pairs • Continuous measures: AP, nDCG@20, ERR@20 • Discrete measures: P@10, P@20, RR • Points of Interest 1. Margins 2. Copulas 3. Simulated scores 22
- 28. • 1572 system-measure pairs • 5425 models successfully fitted • Log-Likelihood: • Kernel Smoothing (esp. discrete) • Normal & Beta 25% of cases • AIC and BIC: • Normal & Beta 67% of cases • Beta-Binomial 50% of P@k • Transform all to the mean in the given data and select again: • Kernel Smoothing nearly always 1. Margins 23
- 29. • 39627 system pairs • Fit pair-copulas and select according to Log-Likelihood • Wide diversity • Gaussian copulas rarely selected; correlation is not enough • Complex models are preferred 2. Dependence 24
- 30. • Simulate 1000 new topics and record deviations from the model • 𝜇 − 𝑋 and 𝜎2 − 𝑠2 • Repeat 1000 times • Full knowledge of truth encoded in the model 3. Simulation: Scores 25
- 31. • Web 2010, nDCG@20 • Simulate 500 new topics • Dependence captured in the model 3. Simulation: Dependencies 26
- 32. SAMPLE APPLICATIONS
- 33. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
- 34. [Voorhees, 2009] 1. Type I Errors? 28 S1 S2
- 35. [Voorhees, 2009] 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
- 36. [Voorhees, 2009] • Limited data • Unknown truth (is H0 true?) • No control over H0 • Cannot measure Type I error rates directly • Conflict rates at α=5%: • AP: 2.8% • P@10: 10.9% 1. Type I Errors? 28 𝑫 𝟐 + p-value𝑫 𝟏 + p-value conflict? S1 S2
- 37. [With simulation] Same margins 1. Type I Errors 29
- 38. [With simulation] Same margins 1. Type I Errors 29
- 39. [With simulation] Same margins 1. Type I Errors 29
- 40. [With simulation] Same margins 1. Type I Errors 29
- 41. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% 1. Type I Errors 29 p-value Type I error?
- 42. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
- 43. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
- 44. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins 1. Type I Errors 30
- 45. [With simulation] Same margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 4.9% and 1% Transformed margins • Type I errors at α=5% and 1%: • AP: 4.9% and 0.9% • P@10: 5% and 1% 1. Type I Errors 30 p-value Type I error?
- 46. 2. Sneak Peak: Statistical Power and σ 31 [Webber et al, 2008] • Show empirical evidence of the problem of sequential testing • Limited data • Unknown truth (true σ)
- 47. TAKE HOME
- 48. Today • Part of Evaluation Research has data-related limitations –Lack of data, no knowledge of truth, no control –How valid are our results? • We propose a methodology for stochastic simulation to eliminate these limitations –Flexible, realistic, highly customizable –Allows us to study new problems, directly 33
- 49. Tomorrow • Even more flexibility • Simulate new systems for given topics • Add third factors –Fixed: already possible –Random: we’ll see • Simulate full runs (doc scores & relevance) 34
- 50. simIReff • All results fully reproducible • Developed a full R-package for simulation https://github.com/julian-urbano/simIReff effs <- effDiscFitAndSelect(data, support("p20")) cop <- effcopFit(data, effs) y <- reffcop(1000, cop) 35

No public clipboards found for this slide

Be the first to comment