SlideShare a Scribd company logo
The Probability That
Your Hypothesis Is Correct,
Credible Intervals, and
Effect Sizes for IR Evaluation
Tetsuya Sakai
Waseda University
August 8, 2017@SIGIR2017, Tokyo.
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
1. P-values can indicate how incompatible the data are with a
specified statistical model.
2. P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
3. Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
P-value = P(D+|H)
Probability of observing the observed data D or
something more extreme UNDER Hypothesis H.
Problems with classical significance testing
• Statistical significance ≠ practical significance
• Dichotomous thinking: statistically significant (p<0.05) or not
• P-values and CIs are often misunderstood (see the ASA statements)
• Even if the p-value is reported, p-value = f(effect_size, sample_size)
large effect_size (magnitude of difference) ⇒ small p-value
large sample_size (e.g. #topics) ⇒ small p-value.
So effect sizes should be reported [Sakai14SIGIRforum].
“I have learned and taught that the primary product of a research inquiry is
one or more measures of effect size, not p values” [Cohen1990]
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Statisticians are going Bayesian
According to [Toyoda15], over one-half of Biometrika papers published
in 2014 utilised Bayesian statistics.
William S. Gosset
Bayes’ rule (x: data;
θ: parameter, e.g. population mean)
Posterior probability
distribution of θ
of x given θ
Prior probability
distribution of θ
Normalising constant that ensures
Fisher hated the
Bayesian approach
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
Bayesian methods can discuss P(H|D) directly and can easily
handle various hypotheses. There really is no reason to reject them now.
But classical tests also rely on assumptions
[Kruschke13] Journal of Experimental Psychology
“Some people may wonder which approach, Bayesian
or NHST, is more often correct. This question has
limited applicability because in real research we never
know the ground truth: all we have is a sample of data.
[...] the relevant question is asking which method
provides the richest, most informative, and meaningful
results for any set of data. The answer is always
Bayesian estimation.”
NHST = Null Hypothesis Significance Testing
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
f(θ|x) is governed by the kernel
Posterior probability
distribution of θ
Normalising constant that ensures
The property of f(θ|x)
is governed by
the kernel
f(θ|x) is now governed by the likelihood
Posterior probability
distribution of θ
of x given θ
We don’t know the prior so use a
uniform distribution
Expected A Posteriori (EAP) estimate of θ
With a uniform
same as the
Estimate (MLE)
Posterior variance and credible intervals
Posterior variance: how θ moves around
Credible interval:
α/2 α/2
1 – α
100(1-α)% credible interval for θ
Frequentist vs. Bayesian
• θ is a constant!
• 95% confidence interval (CI)
Construct 100 CIs using 100
different samples. 95 of the 100
CIs will actually contain θ, the
• θ is a random variable!
• The probability that θ lies within
the 95% credible interval is 95%.
100 CIs
α/2 α/2
1 – α
100(1-α)% credible interval for θ
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
MCMC (Markov Chain Monte Carlo), in particular,
HMC (Hamiltonian Monte Carlo) and a variant implemented in stan
Yes, but we use uniform priors,
so our EAP estimates are also MLE estimates
MCMC (Markov Chain Monte Carlo)
• Methods for sampling θ repeatedly according to f(θ|x).
• Construct Markov Chains of θ values so that, after a burn-in period (B
values), all of the values obey f(θ|x).
• Collect T’ values sequentially, throw away the initial B values, to
obtain T = T’ – B realisations of θ.
Things that we can obtain from the T realisations
EAP for θ: Just take the average over T realisations of θ.
Credible interval for θ: Just take the middle 100(1-α)% of the T
realisations of θ.
The above methods apply to quantities other than θ itself, including
effect sizes.
Probability that hypothesis H is correct: Just count the number of
realisations that satisfy H and divide by T!
HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (1)
• Given a curved surface in an ideal physical world, any object would
move on the surface while keeping the Hamiltonian (= potential
energy + kinetic energy) constant.
• The path of the object is governed by Hamiltonian’s equations of
motion, solved by the leap-frog method with parameters ε (stepsize)
and L (leapfrog steps).
• To sample from f(θ|x) (our “curved surface”), we put an object on the
surface and give it a push; after L units of time, record its position and
give it another push...
In practice, ε can be set automatically;
a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (2)
With HMC, r is often close to 1
⇒ few rejections = efficient sampling
• checks convergence by comparing the variance across multiple
chains with that within the chains.
• quantifies sampling efficiency: “you have obtained a sample of
size T, but that’s worth a sample of size when there is zero
correlation within the chain.”
In practice, we let T=100,000, which much more than enough to worry
about convergence.
Stan’s criteria for checking the chains
HMC/Stan vs. other MCMC methods
[Hoffman+14] HMC’s features “allow it to converge to high-dimensional
target distributions much more quickly than simpler methods such as
random walk Metropolis[-Hastings] or Gibbs sampling”
[Kruschke15] “HMC can be more effective than the various samplers in JAGS
and BUGS, especially for large complex models. [...] However, Stan is not
universally faster or better (at this stage in its development).”
In IR,
[Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs
Sampler), an open-source implementation of BUGS (Bayesian inference
Using Gibbs Sampling);
[Zhang+16] used Metropolis-Hastings.
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Consider the problem of comparing two means
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
S1’s scores S2’s scores S1’s scores S2’s scores
Probability of observing the observed data D
or something more extreme under H0
But what we really want is...
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
S1’s scores S2’s scores S1’s scores S2’s scores
With the Bayesian approach, we can easily obtain
P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by
simply counting the number of realisations that satisfy S1>S2
and dividing by T!
Statistical models (classical and Bayesian)
Paired test
Classical paired t-test:
score differences obey
Bayesian: scores obey
a bivariate
Unpaired (=Two-sample) test
S1’s scores obey
S2’s scores obey
Classical test: Welch’s t-test
Stan code
Effect size (Glass’s Δ) [Okubo+12]
• In the context of classical significance testing, [Sakai14SIGIRforum] stresses
the importance of reporting sample effect sizes and confidence intervals
• Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s
consider the diff between S1 and S2 standardised by an “ordinary”
standard deviation (i.e., that of the baseline):
• With classical tests, this can simply be estimated from the sample.
• With Bayesian tests, and EAP and a credible interval for Δ can easily be
obtained from the T realisations.
Unlike Cohen’s d/Hedges’ g, free from the
homoscedasticity assumption
Proposal summary
• Classical paired and unpaired t-tests can easily be replaced by
Bayesian tests.
• Report the following in papers:
- EAP for the difference in population means;
- 95% credible interval for the above difference;
- P(S1>S2|D), i.e., probability that H: S1>S2 is correct;
- EAP for Glass’s Δ;
- 95% credible interval for the above Δ.
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Purpose of the experiments
For both paired and unpaired tests:
• How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis)
differ from the p-value P(D+|S1=S2)?
• How does the Bayesian credible interval differ from the classical CI?
• How does the Bayesian EAP for Glass’s Δ differ from the classical
sample-based Δ?
Will the Bayesian approach turn the IR literature upside down?
20 runs = 190 run pairs were compared
(w/o considering the familywise error rate)
P(S1<S2|D) vs p-value for paired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
Credible vs Confidence intervals for paired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
EAP Δ vs sample Δ for paired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
More results in paper
EAP Δ vs sample Δ for paired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
P(S1<S2|D) vs p-value for unpaired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
Credible vs Confidence intervals for unpaired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
EAP Δ vs sample Δ for unpaired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
same sample Δ from the paired test experiments
EAP Δ vs sample Δ for unpaired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Install R, rstan and Rtools (for Windows)
1. Install R from
2. On R, install a package called rstan
3. Install Rtools from
(check “EDIT the system PATH”)
Install my sample scripts
1. Download
2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R .
3. Move sample score files run1_vs_run2.paired.R and
run1_vs_run2.unpaired.R also to C:/work/R .
Try running the scripts (paired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file)
par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files
data <- list( N=N, x=x ) #sample size and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
Try running the scripts (paired test) (2)
Try running the scripts (paired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
EAP/credible interval for the Δ
using S2’s standard deviation
P(S1>S2|D) and P(S1<S2|D)
EAP/credible interval for the
correlation coefficient
The thresholds for the
probabilities can be
modified by editing the script
Try running the scripts (unpaired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file)
par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files
data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
Try running the scripts (unpaired test) (2)
Try running the scripts (unpaired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
P(S1>S2|D) and P(S1<S2|D)
The thresholds for the
probabilities can be
modified by editing the script
Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value,
P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the
paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189).
EAP/credible interval for the Δ
using S2’s standard deviation
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
Acknowledgements: I thank
• Professor Hideki Toyoda (Waseda University)
For letting me tweak his R code and for his wonderful books on
Bayesian tests (in Japanese).
Professor Toyoda’s original code is available at
(with comments in Japanese)
• Dr. Matthew Ekstrand-Abueg (Google)
For letting me play with the TREC Temporal Summarisation results!
References (1)
[Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016.
[Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances.
Philosophical Transactions of the Royal Society of London, 53, 1763.
[Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931).
[Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR
[Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990.
[Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd,
[Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental
Psychology: General, 142(2), 2013.
[Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
References (2)
[Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015.
[Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain
Monte Carlo, Chapman & Hall, 2015.
[Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence
Interval, and Power. Keiso Shobo, 2012.
[Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014.
[Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by
Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015.
[Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose.
The American Statistician, 2016.
[Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016.
[Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs,
Justice, and Lives. The University of Michigan Press, 2008.

More Related Content

What's hot

ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
Christian Robert
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Wayne Lee
Nature inspired metaheuristics
Nature inspired metaheuristicsNature inspired metaheuristics
Nature inspired metaheuristics
Gem WeBlog
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
Christian Robert
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
Christian Robert
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
Boston talk
Boston talkBoston talk
Boston talk
Christian Robert
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
Christian Robert
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
Christian Robert
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
Erika G. G.
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar
Christian Robert
Analysis of optimization algorithms
Analysis of optimization algorithmsAnalysis of optimization algorithms
Analysis of optimization algorithms
Gem WeBlog
Machine learning
Machine learningMachine learning
Machine learning
Sukhwinder Singh
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
Christian Robert
Srimatre K
Cwkaa 2010
Cwkaa 2010Cwkaa 2010
Cwkaa 2010
Sam Neaves
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
The Statistical and Applied Mathematical Sciences Institute

What's hot (20)

ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Nature inspired metaheuristics
Nature inspired metaheuristicsNature inspired metaheuristics
Nature inspired metaheuristics
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Max Entropy
Max EntropyMax Entropy
Max Entropy
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
Boston talk
Boston talkBoston talk
Boston talk
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar
Analysis of optimization algorithms
Analysis of optimization algorithmsAnalysis of optimization algorithms
Analysis of optimization algorithms
Machine learning
Machine learningMachine learning
Machine learning
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
Cwkaa 2010
Cwkaa 2010Cwkaa 2010
Cwkaa 2010
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...

Similar to sigir2017bayesian

Tetsuya Sakai
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
Tetsuya Sakai
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
Christian Robert
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Christian Robert
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Chernick.Michael (1).ppt
Chernick.Michael (1).pptChernick.Michael (1).ppt
Chernick.Michael (1).ppt
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
Jiawang Liu
chi_square test.pptx
chi_square test.pptxchi_square test.pptx
chi_square test.pptx
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence Intervals
Tetsuya Sakai
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
Andrea Arcuri
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
King Khalid University

Similar to sigir2017bayesian (20)

Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
Basic of Hypothesis Testing TEKU QM
Basic of Hypothesis Testing TEKU QMBasic of Hypothesis Testing TEKU QM
Basic of Hypothesis Testing TEKU QM
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Bayesian intro
Bayesian introBayesian intro
Bayesian intro
Chernick.Michael (1).ppt
Chernick.Michael (1).pptChernick.Michael (1).ppt
Chernick.Michael (1).ppt
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
chi_square test.pptx
chi_square test.pptxchi_square test.pptx
chi_square test.pptx
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence Intervals
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf

More from Tetsuya Sakai

Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai
Tetsuya Sakai

More from Tetsuya Sakai (20)


Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™


  • 1. The Probability That Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation Tetsuya Sakai Waseda University @tetsuyasakai August 8, 2017@SIGIR2017, Tokyo.
  • 2. Takeaways • Many statisticians now use Bayesian statistics to discuss P(H|D) instead of the classical p-values (P(D+|H)). Likewise, IR community should not shy from the Bayesian approach, since it enables us to easily discuss P(H|D) for virtually any Hypothesis H. • Results in the IR literature which relied on classical significance tests are not necessarily wrong, since P(H|D) and credible intervals are actually quite similar to p-values and confidence intervals. • Starting today, report the right statistics, including the effect sizes, using Bayesian statistics (perhaps along with classical ones). Simple tools are available from:
  • 3. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 4. 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. [Wasserstein+16]
  • 5. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. P-value = P(D+|H) Probability of observing the observed data D or something more extreme UNDER Hypothesis H. [Wasserstein+16]
  • 6. Problems with classical significance testing • Statistical significance ≠ practical significance • Dichotomous thinking: statistically significant (p<0.05) or not • P-values and CIs are often misunderstood (see the ASA statements) • Even if the p-value is reported, p-value = f(effect_size, sample_size) large effect_size (magnitude of difference) ⇒ small p-value large sample_size (e.g. #topics) ⇒ small p-value. So effect sizes should be reported [Sakai14SIGIRforum]. “I have learned and taught that the primary product of a research inquiry is one or more measures of effect size, not p values” [Cohen1990]
  • 8. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 9. Statisticians are going Bayesian According to [Toyoda15], over one-half of Biometrika papers published in 2014 utilised Bayesian statistics. William S. Gosset
  • 10. Bayes’ rule (x: data; θ: parameter, e.g. population mean) Posterior probability distribution of θ Likelihood of x given θ Prior probability distribution of θ Normalising constant that ensures [Bayes1763]
  • 12. Old criticisms on the Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x).
  • 13. Old criticisms on the Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x). STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform distribution) is a subjective choice. NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST! Bayesian methods can discuss P(H|D) directly and can easily handle various hypotheses. There really is no reason to reject them now. But classical tests also rely on assumptions
  • 14. [Kruschke13] Journal of Experimental Psychology “Some people may wonder which approach, Bayesian or NHST, is more often correct. This question has limited applicability because in real research we never know the ground truth: all we have is a sample of data. [...] the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” NHST = Null Hypothesis Significance Testing
  • 15. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 16. f(θ|x) is governed by the kernel Posterior probability distribution of θ Normalising constant that ensures The property of f(θ|x) is governed by the kernel
  • 17. f(θ|x) is now governed by the likelihood Posterior probability distribution of θ Likelihood of x given θ We don’t know the prior so use a uniform distribution Expected A Posteriori (EAP) estimate of θ With a uniform prior, same as the Maximum Likelihood Estimate (MLE)
  • 18. Posterior variance and credible intervals Posterior variance: how θ moves around Credible interval: f(θ|x) α/2 α/2 1 – α 100(1-α)% credible interval for θ
  • 19. Frequentist vs. Bayesian • θ is a constant! • 95% confidence interval (CI) means: Construct 100 CIs using 100 different samples. 95 of the 100 CIs will actually contain θ, the constant. • θ is a random variable! • The probability that θ lies within the 95% credible interval is 95%. θ 100 CIs f(θ|x) α/2 α/2 1 – α 100(1-α)% credible interval for θ
  • 20. Old criticisms on the Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x). STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform distribution) is a subjective choice. NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST! MCMC (Markov Chain Monte Carlo), in particular, HMC (Hamiltonian Monte Carlo) and a variant implemented in stan Yes, but we use uniform priors, so our EAP estimates are also MLE estimates
  • 21. MCMC (Markov Chain Monte Carlo) • Methods for sampling θ repeatedly according to f(θ|x). • Construct Markov Chains of θ values so that, after a burn-in period (B values), all of the values obey f(θ|x). • Collect T’ values sequentially, throw away the initial B values, to obtain T = T’ – B realisations of θ. f(θ|x)
  • 22. Things that we can obtain from the T realisations EAP for θ: Just take the average over T realisations of θ. Credible interval for θ: Just take the middle 100(1-α)% of the T realisations of θ. The above methods apply to quantities other than θ itself, including effect sizes. Probability that hypothesis H is correct: Just count the number of realisations that satisfy H and divide by T!
  • 23. HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]: a state-of-the-art MCMC method (1) • Given a curved surface in an ideal physical world, any object would move on the surface while keeping the Hamiltonian (= potential energy + kinetic energy) constant. • The path of the object is governed by Hamiltonian’s equations of motion, solved by the leap-frog method with parameters ε (stepsize) and L (leapfrog steps). • To sample from f(θ|x) (our “curved surface”), we put an object on the surface and give it a push; after L units of time, record its position and give it another push... In practice, ε can be set automatically; a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
  • 24. HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]: a state-of-the-art MCMC method (2) With HMC, r is often close to 1 ⇒ few rejections = efficient sampling
  • 25. • checks convergence by comparing the variance across multiple chains with that within the chains. • quantifies sampling efficiency: “you have obtained a sample of size T, but that’s worth a sample of size when there is zero correlation within the chain.” In practice, we let T=100,000, which much more than enough to worry about convergence. Stan’s criteria for checking the chains
  • 26. HMC/Stan vs. other MCMC methods [Hoffman+14] HMC’s features “allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis[-Hastings] or Gibbs sampling” [Kruschke15] “HMC can be more effective than the various samplers in JAGS and BUGS, especially for large complex models. [...] However, Stan is not universally faster or better (at this stage in its development).” In IR, [Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs Sampler), an open-source implementation of BUGS (Bayesian inference Using Gibbs Sampling); [Zhang+16] used Metropolis-Hastings.
  • 27. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 28. Consider the problem of comparing two means Paired test Unpaired (=Two-sample) test Classical one-sided test: H0 : S1 = S2 H1 : S1 > S2 p-value = P(D+|S1 = S2) : S1’s scores S2’s scores S1’s scores S2’s scores Probability of observing the observed data D or something more extreme under H0
  • 29. But what we really want is... Paired test Unpaired (=Two-sample) test Classical one-sided test: H0 : S1 = S2 H1 : S1 > S2 p-value = P(D+|S1 = S2) : S1’s scores S2’s scores S1’s scores S2’s scores With the Bayesian approach, we can easily obtain P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by simply counting the number of realisations that satisfy S1>S2 and dividing by T!
  • 30. Statistical models (classical and Bayesian) Paired test Classical paired t-test: score differences obey Bayesian: scores obey a bivariate normal distribution Unpaired (=Two-sample) test S1’s scores obey S2’s scores obey Classical test: Welch’s t-test Bayesian: Stan code [Toyoda15]
  • 31. Effect size (Glass’s Δ) [Okubo+12] • In the context of classical significance testing, [Sakai14SIGIRforum] stresses the importance of reporting sample effect sizes and confidence intervals (CIs). • Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s consider the diff between S1 and S2 standardised by an “ordinary” standard deviation (i.e., that of the baseline): • With classical tests, this can simply be estimated from the sample. • With Bayesian tests, and EAP and a credible interval for Δ can easily be obtained from the T realisations. Unlike Cohen’s d/Hedges’ g, free from the homoscedasticity assumption
  • 32. Proposal summary • Classical paired and unpaired t-tests can easily be replaced by Bayesian tests. • Report the following in papers: - EAP for the difference in population means; - 95% credible interval for the above difference; - P(S1>S2|D), i.e., probability that H: S1>S2 is correct; - EAP for Glass’s Δ; - 95% credible interval for the above Δ. Raw difference Standardised difference
  • 33. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 34. Purpose of the experiments For both paired and unpaired tests: • How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis) differ from the p-value P(D+|S1=S2)? • How does the Bayesian credible interval differ from the classical CI? • How does the Bayesian EAP for Glass’s Δ differ from the classical sample-based Δ? Will the Bayesian approach turn the IR literature upside down?
  • 35. Data 20 runs = 190 run pairs were compared (w/o considering the familywise error rate)
  • 36. P(S1<S2|D) vs p-value for paired tests Bottom line: p-value can be regarded as a reasonable approximation of P(S1<S2|D) which is what we really want! More results in paper
  • 37. Credible vs Confidence intervals for paired tests Bottom line: CI can be regarded as a reasonable approximation of credible interval which is what we really want! More results in paper
  • 38. EAP Δ vs sample Δ for paired tests The classical sample Δ probably underestimate small effect sizes. “the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” [Kruschke15] More results in paper
  • 39. EAP Δ vs sample Δ for paired tests – an anomalous result • This measure from the TREC Temporal Summarisation Track [Aslam+16] is like a nugget-based F-measure over a timeline. • The actual score distribution of this measure is [0, 0.4021] rather than [0,1], with very low standard deviations (and hence very high effect sizes). • The measure is not as well-understood as the others and deserves an investigation.
  • 40. P(S1<S2|D) vs p-value for unpaired tests Bottom line: p-value can be regarded as a reasonable approximation of P(S1<S2|D) which is what we really want! More results in paper
  • 41. Credible vs Confidence intervals for unpaired tests Bottom line: CI can be regarded as a reasonable approximation of credible interval which is what we really want! More results in paper
  • 42. EAP Δ vs sample Δ for unpaired tests The classical sample Δ probably underestimate small effect sizes. “the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” [Kruschke15] same sample Δ from the paired test experiments
  • 43. EAP Δ vs sample Δ for unpaired tests – an anomalous result • This measure from the TREC Temporal Summarisation Track [Aslam+16] is like a nugget-based F-measure over a timeline. • The actual score distribution of this measure is [0, 0.4021] rather than [0,1], with very low standard deviations (and hence very high effect sizes). • The measure is not as well-understood as the others and deserves an investigation.
  • 44. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 45. Install R, rstan and Rtools (for Windows) 1. Install R from 2. On R, install a package called rstan 3. Install Rtools from (check “EDIT the system PATH”)
  • 46. Install my sample scripts 1. Download 2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R . 3. Move sample score files run1_vs_run2.paired.R and run1_vs_run2.unpaired.R also to C:/work/R .
  • 47. Try running the scripts (paired test) (1) Try this (BayesPaired-sample.R) on the R interface: library(rstan) scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file) par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta war <- 1000 #burn-in ite <- 21000 #iteration including burn-in see <- 1234 #seed dig <- 3 #significant digits cha <- 5 #chains (number of trials = (ite-war)*cha) source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file) outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files data <- list( N=N, x=x ) #sample size and per-topic scores fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile ) # compile stan file and execute sampling (HMC may be used instead of NUTs) print( fit, pars=par, digits_summary=dig ) Five output csv files: C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv
  • 48. Try running the scripts (paired test) (2)
  • 49. Try running the scripts (paired test) (3) Switch to a UNIX-like environment and process the output csv files using this shell script: Start line number after burn-in T #chains = #csvfiles EAP/credible interval for the difference EAP/credible interval for the Δ using S2’s standard deviation P(S1>S2|D) and P(S1<S2|D) EAP/credible interval for the correlation coefficient The thresholds for the probabilities can be modified by editing the script
  • 50. Try running the scripts (unpaired test) (1) Try this (BayesPaired-sample.R) on the R interface: library(rstan) scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file) par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta war <- 1000 #burn-in ite <- 21000 #iteration including burn-in see <- 1234 #seed dig <- 3 #significant digits cha <- 5 #chains (number of trials = (ite-war)*cha) source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file) outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile ) # compile stan file and execute sampling (HMC may be used instead of NUTs) print( fit, pars=par, digits_summary=dig ) Five output csv files: C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv
  • 51. Try running the scripts (unpaired test) (2)
  • 52. Try running the scripts (unpaired test) (3) Switch to a UNIX-like environment and process the output csv files using this shell script: Start line number after burn-in T #chains = #csvfiles EAP/credible interval for the difference P(S1>S2|D) and P(S1<S2|D) The thresholds for the probabilities can be modified by editing the script Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value, P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189). EAP/credible interval for the Δ using S2’s standard deviation
  • 53. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 54. Takeaways • Many statisticians now use Bayesian statistics to discuss P(H|D) instead of the classical p-values (P(D+|H)). Likewise, IR community should not shy from the Bayesian approach, since it enables us to easily discuss P(H|D) for virtually any Hypothesis H. • Results in the IR literature which relied on classical significance tests are not necessarily wrong, since P(H|D) and credible intervals are actually quite similar to p-values and confidence intervals. • Starting today, report the right statistics, including the effect sizes, using Bayesian statistics (perhaps along with classical ones). Simple tools are available from:
  • 55. Acknowledgements: I thank • Professor Hideki Toyoda (Waseda University) For letting me tweak his R code and for his wonderful books on Bayesian tests (in Japanese). Professor Toyoda’s original code is available at (with comments in Japanese) • Dr. Matthew Ekstrand-Abueg (Google) For letting me play with the TREC Temporal Summarisation results!
  • 56. References (1) [Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016. [Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London, 53, 1763. [Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931). [Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR 2015. [Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990. [Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd, 1970. [Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental Psychology: General, 142(2), 2013. [Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
  • 57. References (2) [Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015. [Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain Monte Carlo, Chapman & Hall, 2015. [Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval, and Power. Keiso Shobo, 2012. [Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014. [Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015. [Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose. The American Statistician, 2016. [Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016. [Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, 2008.