sigir2017bayesian

The Probability That
Your Hypothesis Is Correct,
Credible Intervals, and
Effect Sizes for IR Evaluation
Tetsuya Sakai
Waseda University
tetsuyasakai@acm.org
@tetsuyasakai
August 8, 2017@SIGIR2017, Tokyo.

Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK

TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again

1. P-values can indicate how incompatible the data are with a
specified statistical model.
2. P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
3. Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold.
[Wasserstein+16]

4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
P-value = P(D+|H)
Probability of observing the observed data D or
something more extreme UNDER Hypothesis H.
[Wasserstein+16]

Problems with classical significance testing
• Statistical significance ≠ practical significance
• Dichotomous thinking: statistically significant (p<0.05) or not
• P-values and CIs are often misunderstood (see the ASA statements)
• Even if the p-value is reported, p-value = f(effect_size, sample_size)
large effect_size (magnitude of difference) ⇒ small p-value
large sample_size (e.g. #topics) ⇒ small p-value.
So effect sizes should be reported [Sakai14SIGIRforum].
“I have learned and taught that the primary product of a research inquiry is
one or more measures of effect size, not p values” [Cohen1990]

[Ziliak+08]
dichotomous
thinking
practical
significance

Statisticians are going Bayesian
According to [Toyoda15], over one-half of Biometrika papers published
in 2014 utilised Bayesian statistics.
William S. Gosset

Bayes’ rule (x: data;
θ: parameter, e.g. population mean)
Posterior probability
distribution of θ
Likelihood
of x given θ
Prior probability
distribution of θ
Normalising constant that ensures
[Bayes1763]

[Fisher1970]
Fisher hated the
Bayesian approach

Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).

STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
Bayesian methods can discuss P(H|D) directly and can easily
handle various hypotheses. There really is no reason to reject them now.
But classical tests also rely on assumptions

[Kruschke13] Journal of Experimental Psychology
“Some people may wonder which approach, Bayesian
or NHST, is more often correct. This question has
limited applicability because in real research we never
know the ground truth: all we have is a sample of data.
[...] the relevant question is asking which method
provides the richest, most informative, and meaningful
results for any set of data. The answer is always
Bayesian estimation.”
NHST = Null Hypothesis Significance Testing

f(θ|x) is governed by the kernel
distribution of θ
Normalising constant that ensures
The property of f(θ|x)
is governed by
the kernel

f(θ|x) is now governed by the likelihood
distribution of θ
Likelihood
of x given θ
We don’t know the prior so use a
uniform distribution
Expected A Posteriori (EAP) estimate of θ
With a uniform
prior,
same as the
Maximum
Likelihood
Estimate (MLE)

Posterior variance and credible intervals
Posterior variance: how θ moves around
Credible interval:
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ

Frequentist vs. Bayesian
• θ is a constant!
• 95% confidence interval (CI)
means:
Construct 100 CIs using 100
different samples. 95 of the 100
CIs will actually contain θ, the
constant.
• θ is a random variable!
• The probability that θ lies within
the 95% credible interval is 95%.
θ
100 CIs
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ

STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
MCMC (Markov Chain Monte Carlo), in particular,
HMC (Hamiltonian Monte Carlo) and a variant implemented in stan
Yes, but we use uniform priors,
so our EAP estimates are also MLE estimates

MCMC (Markov Chain Monte Carlo)
• Methods for sampling θ repeatedly according to f(θ|x).
• Construct Markov Chains of θ values so that, after a burn-in period (B
values), all of the values obey f(θ|x).
• Collect T’ values sequentially, throw away the initial B values, to
obtain T = T’ – B realisations of θ.
f(θ|x)

Things that we can obtain from the T realisations
EAP for θ: Just take the average over T realisations of θ.
Credible interval for θ: Just take the middle 100(1-α)% of the T
realisations of θ.
The above methods apply to quantities other than θ itself, including
effect sizes.
Probability that hypothesis H is correct: Just count the number of
realisations that satisfy H and divide by T!

HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (1)
• Given a curved surface in an ideal physical world, any object would
move on the surface while keeping the Hamiltonian (= potential
energy + kinetic energy) constant.
• The path of the object is governed by Hamiltonian’s equations of
motion, solved by the leap-frog method with parameters ε (stepsize)
and L (leapfrog steps).
• To sample from f(θ|x) (our “curved surface”), we put an object on the
surface and give it a push; after L units of time, record its position and
give it another push...
In practice, ε can be set automatically;
a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.

HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (2)
With HMC, r is often close to 1
⇒ few rejections = efficient sampling

• checks convergence by comparing the variance across multiple
chains with that within the chains.
• quantifies sampling efficiency: “you have obtained a sample of
size T, but that’s worth a sample of size when there is zero
correlation within the chain.”
In practice, we let T=100,000, which much more than enough to worry
about convergence.
Stan’s criteria for checking the chains

HMC/Stan vs. other MCMC methods
[Hoffman+14] HMC’s features “allow it to converge to high-dimensional
target distributions much more quickly than simpler methods such as
random walk Metropolis[-Hastings] or Gibbs sampling”
[Kruschke15] “HMC can be more effective than the various samplers in JAGS
and BUGS, especially for large complex models. [...] However, Stan is not
universally faster or better (at this stage in its development).”
In IR,
[Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs
Sampler), an open-source implementation of BUGS (Bayesian inference
Using Gibbs Sampling);
[Zhang+16] used Metropolis-Hastings.

Consider the problem of comparing two means
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
Probability of observing the observed data D
or something more extreme under H0

But what we really want is...
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
With the Bayesian approach, we can easily obtain
P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by
simply counting the number of realisations that satisfy S1>S2
and dividing by T!

Statistical models (classical and Bayesian)
Paired test
Classical paired t-test:
score differences obey
Bayesian: scores obey
a bivariate
normal
distribution
Unpaired (=Two-sample) test
S1’s scores obey
S2’s scores obey
Classical test: Welch’s t-test
Bayesian:
Stan code
[Toyoda15]

Effect size (Glass’s Δ) [Okubo+12]
• In the context of classical significance testing, [Sakai14SIGIRforum] stresses
the importance of reporting sample effect sizes and confidence intervals
(CIs).
• Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s
consider the diff between S1 and S2 standardised by an “ordinary”
standard deviation (i.e., that of the baseline):
• With classical tests, this can simply be estimated from the sample.
• With Bayesian tests, and EAP and a credible interval for Δ can easily be
obtained from the T realisations.
Unlike Cohen’s d/Hedges’ g, free from the
homoscedasticity assumption

Proposal summary
• Classical paired and unpaired t-tests can easily be replaced by
Bayesian tests.
• Report the following in papers:
- EAP for the difference in population means;
- 95% credible interval for the above difference;
- P(S1>S2|D), i.e., probability that H: S1>S2 is correct;
- EAP for Glass’s Δ;
- 95% credible interval for the above Δ.
Raw
difference
Standardised
difference

Purpose of the experiments
For both paired and unpaired tests:
• How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis)
differ from the p-value P(D+|S1=S2)?
• How does the Bayesian credible interval differ from the classical CI?
• How does the Bayesian EAP for Glass’s Δ differ from the classical
sample-based Δ?
Will the Bayesian approach turn the IR literature upside down?

Data
20 runs = 190 run pairs were compared
(w/o considering the familywise error rate)

P(S1<S2|D) vs p-value for paired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper

Credible vs Confidence intervals for paired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!

EAP Δ vs sample Δ for paired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]

EAP Δ vs sample Δ for paired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.

P(S1<S2|D) vs p-value for unpaired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!

Credible vs Confidence intervals for unpaired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!

EAP Δ vs sample Δ for unpaired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
same sample Δ from the paired test experiments

EAP Δ vs sample Δ for unpaired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.

Install R, rstan and Rtools (for Windows)
1. Install R from https://www.r-project.org/
2. On R, install a package called rstan
3. Install Rtools from https://cran.r-project.org/bin/windows/Rtools/
(check “EDIT the system PATH”)

Install my sample scripts
1. Download http://waseda.box.com/SIGIR2017PACK
2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R .
3. Move sample score files run1_vs_run2.paired.R and
run1_vs_run2.unpaired.R also to C:/work/R .

Try running the scripts (paired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file)
par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files
data <- list( N=N, x=x ) #sample size and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv

Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
EAP/credible interval for the Δ
using S2’s standard deviation
P(S1>S2|D) and P(S1<S2|D)
EAP/credible interval for the
correlation coefficient
The thresholds for the
probabilities can be
modified by editing the script

Try running the scripts (unpaired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file)
par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files
data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv

Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
P(S1>S2|D) and P(S1<S2|D)
The thresholds for the
probabilities can be
modified by editing the script
Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value,
P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the
paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189).
EAP/credible interval for the Δ
using S2’s standard deviation

Acknowledgements: I thank
• Professor Hideki Toyoda (Waseda University)
For letting me tweak his R code and for his wonderful books on
Bayesian tests (in Japanese).
Professor Toyoda’s original code is available at
http://www.asakura.co.jp/G_27_2.php?id=200
(with comments in Japanese)
• Dr. Matthew Ekstrand-Abueg (Google)
For letting me play with the TREC Temporal Summarisation results!

References (1)
[Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016.
[Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances.
Philosophical Transactions of the Royal Society of London, 53, 1763.
[Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931).
[Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR
2015.
[Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990.
[Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd,
1970.
[Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental
Psychology: General, 142(2), 2013.
[Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.

References (2)
[Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015.
[Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain
Monte Carlo, Chapman & Hall, 2015.
[Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence
Interval, and Power. Keiso Shobo, 2012.
[Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014.
[Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by
Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015.
[Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose.
The American Statistician, 2016.
[Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016.
[Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs,
Justice, and Lives. The University of Michigan Press, 2008.

sigir2017bayesian

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to sigir2017bayesian

Similar to sigir2017bayesian (20)

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

Recently uploaded

Recently uploaded (20)

sigir2017bayesian