The Probability That
Your Hypothesis Is Correct,
Credible Intervals, and
Effect Sizes for IR Evaluation
Tetsuya Sakai
Waseda University
tetsuyasakai@acm.org
@tetsuyasakai
August 8, 2017@SIGIR2017, Tokyo.
Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
1. P-values can indicate how incompatible the data are with a
specified statistical model.
2. P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
3. Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold.
[Wasserstein+16]
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
P-value = P(D+|H)
Probability of observing the observed data D or
something more extreme UNDER Hypothesis H.
[Wasserstein+16]
Problems with classical significance testing
• Statistical significance ≠ practical significance
• Dichotomous thinking: statistically significant (p<0.05) or not
• P-values and CIs are often misunderstood (see the ASA statements)
• Even if the p-value is reported, p-value = f(effect_size, sample_size)
large effect_size (magnitude of difference) ⇒ small p-value
large sample_size (e.g. #topics) ⇒ small p-value.
So effect sizes should be reported [Sakai14SIGIRforum].
“I have learned and taught that the primary product of a research inquiry is
one or more measures of effect size, not p values” [Cohen1990]
[Ziliak+08]
dichotomous
thinking
practical
significance
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Statisticians are going Bayesian
According to [Toyoda15], over one-half of Biometrika papers published
in 2014 utilised Bayesian statistics.
William S. Gosset
Bayes’ rule (x: data;
θ: parameter, e.g. population mean)
Posterior probability
distribution of θ
Likelihood
of x given θ
Prior probability
distribution of θ
Normalising constant that ensures
[Bayes1763]
[Fisher1970]
Fisher hated the
Bayesian approach
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
Bayesian methods can discuss P(H|D) directly and can easily
handle various hypotheses. There really is no reason to reject them now.
But classical tests also rely on assumptions
[Kruschke13] Journal of Experimental Psychology
“Some people may wonder which approach, Bayesian
or NHST, is more often correct. This question has
limited applicability because in real research we never
know the ground truth: all we have is a sample of data.
[...] the relevant question is asking which method
provides the richest, most informative, and meaningful
results for any set of data. The answer is always
Bayesian estimation.”
NHST = Null Hypothesis Significance Testing
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
f(θ|x) is governed by the kernel
Posterior probability
distribution of θ
Normalising constant that ensures
The property of f(θ|x)
is governed by
the kernel
f(θ|x) is now governed by the likelihood
Posterior probability
distribution of θ
Likelihood
of x given θ
We don’t know the prior so use a
uniform distribution
Expected A Posteriori (EAP) estimate of θ
With a uniform
prior,
same as the
Maximum
Likelihood
Estimate (MLE)
Posterior variance and credible intervals
Posterior variance: how θ moves around
Credible interval:
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ
Frequentist vs. Bayesian
• θ is a constant!
• 95% confidence interval (CI)
means:
Construct 100 CIs using 100
different samples. 95 of the 100
CIs will actually contain θ, the
constant.
• θ is a random variable!
• The probability that θ lies within
the 95% credible interval is 95%.
θ
100 CIs
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
MCMC (Markov Chain Monte Carlo), in particular,
HMC (Hamiltonian Monte Carlo) and a variant implemented in stan
Yes, but we use uniform priors,
so our EAP estimates are also MLE estimates
MCMC (Markov Chain Monte Carlo)
• Methods for sampling θ repeatedly according to f(θ|x).
• Construct Markov Chains of θ values so that, after a burn-in period (B
values), all of the values obey f(θ|x).
• Collect T’ values sequentially, throw away the initial B values, to
obtain T = T’ – B realisations of θ.
f(θ|x)
Things that we can obtain from the T realisations
EAP for θ: Just take the average over T realisations of θ.
Credible interval for θ: Just take the middle 100(1-α)% of the T
realisations of θ.
The above methods apply to quantities other than θ itself, including
effect sizes.
Probability that hypothesis H is correct: Just count the number of
realisations that satisfy H and divide by T!
HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (1)
• Given a curved surface in an ideal physical world, any object would
move on the surface while keeping the Hamiltonian (= potential
energy + kinetic energy) constant.
• The path of the object is governed by Hamiltonian’s equations of
motion, solved by the leap-frog method with parameters ε (stepsize)
and L (leapfrog steps).
• To sample from f(θ|x) (our “curved surface”), we put an object on the
surface and give it a push; after L units of time, record its position and
give it another push...
In practice, ε can be set automatically;
a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (2)
With HMC, r is often close to 1
⇒ few rejections = efficient sampling
• checks convergence by comparing the variance across multiple
chains with that within the chains.
• quantifies sampling efficiency: “you have obtained a sample of
size T, but that’s worth a sample of size when there is zero
correlation within the chain.”
In practice, we let T=100,000, which much more than enough to worry
about convergence.
Stan’s criteria for checking the chains
HMC/Stan vs. other MCMC methods
[Hoffman+14] HMC’s features “allow it to converge to high-dimensional
target distributions much more quickly than simpler methods such as
random walk Metropolis[-Hastings] or Gibbs sampling”
[Kruschke15] “HMC can be more effective than the various samplers in JAGS
and BUGS, especially for large complex models. [...] However, Stan is not
universally faster or better (at this stage in its development).”
In IR,
[Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs
Sampler), an open-source implementation of BUGS (Bayesian inference
Using Gibbs Sampling);
[Zhang+16] used Metropolis-Hastings.
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Consider the problem of comparing two means
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
Probability of observing the observed data D
or something more extreme under H0
But what we really want is...
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
With the Bayesian approach, we can easily obtain
P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by
simply counting the number of realisations that satisfy S1>S2
and dividing by T!
Statistical models (classical and Bayesian)
Paired test
Classical paired t-test:
score differences obey
Bayesian: scores obey
a bivariate
normal
distribution
Unpaired (=Two-sample) test
S1’s scores obey
S2’s scores obey
Classical test: Welch’s t-test
Bayesian:
Stan code
[Toyoda15]
Effect size (Glass’s Δ) [Okubo+12]
• In the context of classical significance testing, [Sakai14SIGIRforum] stresses
the importance of reporting sample effect sizes and confidence intervals
(CIs).
• Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s
consider the diff between S1 and S2 standardised by an “ordinary”
standard deviation (i.e., that of the baseline):
• With classical tests, this can simply be estimated from the sample.
• With Bayesian tests, and EAP and a credible interval for Δ can easily be
obtained from the T realisations.
Unlike Cohen’s d/Hedges’ g, free from the
homoscedasticity assumption
Proposal summary
• Classical paired and unpaired t-tests can easily be replaced by
Bayesian tests.
• Report the following in papers:
- EAP for the difference in population means;
- 95% credible interval for the above difference;
- P(S1>S2|D), i.e., probability that H: S1>S2 is correct;
- EAP for Glass’s Δ;
- 95% credible interval for the above Δ.
Raw
difference
Standardised
difference
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Purpose of the experiments
For both paired and unpaired tests:
• How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis)
differ from the p-value P(D+|S1=S2)?
• How does the Bayesian credible interval differ from the classical CI?
• How does the Bayesian EAP for Glass’s Δ differ from the classical
sample-based Δ?
Will the Bayesian approach turn the IR literature upside down?
Data
20 runs = 190 run pairs were compared
(w/o considering the familywise error rate)
P(S1<S2|D) vs p-value for paired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
Credible vs Confidence intervals for paired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
EAP Δ vs sample Δ for paired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
More results in paper
EAP Δ vs sample Δ for paired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
P(S1<S2|D) vs p-value for unpaired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
Credible vs Confidence intervals for unpaired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
EAP Δ vs sample Δ for unpaired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
same sample Δ from the paired test experiments
EAP Δ vs sample Δ for unpaired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Install R, rstan and Rtools (for Windows)
1. Install R from https://www.r-project.org/
2. On R, install a package called rstan
3. Install Rtools from https://cran.r-project.org/bin/windows/Rtools/
(check “EDIT the system PATH”)
Install my sample scripts
1. Download http://waseda.box.com/SIGIR2017PACK
2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R .
3. Move sample score files run1_vs_run2.paired.R and
run1_vs_run2.unpaired.R also to C:/work/R .
Try running the scripts (paired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file)
par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files
data <- list( N=N, x=x ) #sample size and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv
Try running the scripts (paired test) (2)
Try running the scripts (paired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
EAP/credible interval for the Δ
using S2’s standard deviation
P(S1>S2|D) and P(S1<S2|D)
EAP/credible interval for the
correlation coefficient
The thresholds for the
probabilities can be
modified by editing the script
Try running the scripts (unpaired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file)
par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files
data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv
Try running the scripts (unpaired test) (2)
Try running the scripts (unpaired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
P(S1>S2|D) and P(S1<S2|D)
The thresholds for the
probabilities can be
modified by editing the script
Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value,
P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the
paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189).
EAP/credible interval for the Δ
using S2’s standard deviation
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK
Acknowledgements: I thank
• Professor Hideki Toyoda (Waseda University)
For letting me tweak his R code and for his wonderful books on
Bayesian tests (in Japanese).
Professor Toyoda’s original code is available at
http://www.asakura.co.jp/G_27_2.php?id=200
(with comments in Japanese)
• Dr. Matthew Ekstrand-Abueg (Google)
For letting me play with the TREC Temporal Summarisation results!
References (1)
[Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016.
[Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances.
Philosophical Transactions of the Royal Society of London, 53, 1763.
[Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931).
[Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR
2015.
[Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990.
[Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd,
1970.
[Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental
Psychology: General, 142(2), 2013.
[Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
References (2)
[Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015.
[Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain
Monte Carlo, Chapman & Hall, 2015.
[Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence
Interval, and Power. Keiso Shobo, 2012.
[Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014.
[Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by
Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015.
[Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose.
The American Statistician, 2016.
[Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016.
[Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs,
Justice, and Lives. The University of Michigan Press, 2008.

sigir2017bayesian

  • 1.
    The Probability That YourHypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation Tetsuya Sakai Waseda University tetsuyasakai@acm.org @tetsuyasakai August 8, 2017@SIGIR2017, Tokyo.
  • 2.
    Takeaways • Many statisticiansnow use Bayesian statistics to discuss P(H|D) instead of the classical p-values (P(D+|H)). Likewise, IR community should not shy from the Bayesian approach, since it enables us to easily discuss P(H|D) for virtually any Hypothesis H. • Results in the IR literature which relied on classical significance tests are not necessarily wrong, since P(H|D) and credible intervals are actually quite similar to p-values and confidence intervals. • Starting today, report the right statistics, including the effect sizes, using Bayesian statistics (perhaps along with classical ones). Simple tools are available from: http://waseda.box.com/SIGIR2017PACK
  • 3.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 4.
    1. P-values canindicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. [Wasserstein+16]
  • 5.
    4. Proper inferencerequires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. P-value = P(D+|H) Probability of observing the observed data D or something more extreme UNDER Hypothesis H. [Wasserstein+16]
  • 6.
    Problems with classicalsignificance testing • Statistical significance ≠ practical significance • Dichotomous thinking: statistically significant (p<0.05) or not • P-values and CIs are often misunderstood (see the ASA statements) • Even if the p-value is reported, p-value = f(effect_size, sample_size) large effect_size (magnitude of difference) ⇒ small p-value large sample_size (e.g. #topics) ⇒ small p-value. So effect sizes should be reported [Sakai14SIGIRforum]. “I have learned and taught that the primary product of a research inquiry is one or more measures of effect size, not p values” [Cohen1990]
  • 7.
  • 8.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 9.
    Statisticians are goingBayesian According to [Toyoda15], over one-half of Biometrika papers published in 2014 utilised Bayesian statistics. William S. Gosset
  • 10.
    Bayes’ rule (x:data; θ: parameter, e.g. population mean) Posterior probability distribution of θ Likelihood of x given θ Prior probability distribution of θ Normalising constant that ensures [Bayes1763]
  • 11.
  • 12.
    Old criticisms onthe Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x).
  • 13.
    Old criticisms onthe Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x). STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform distribution) is a subjective choice. NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST! Bayesian methods can discuss P(H|D) directly and can easily handle various hypotheses. There really is no reason to reject them now. But classical tests also rely on assumptions
  • 14.
    [Kruschke13] Journal ofExperimental Psychology “Some people may wonder which approach, Bayesian or NHST, is more often correct. This question has limited applicability because in real research we never know the ground truth: all we have is a sample of data. [...] the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” NHST = Null Hypothesis Significance Testing
  • 15.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 16.
    f(θ|x) is governedby the kernel Posterior probability distribution of θ Normalising constant that ensures The property of f(θ|x) is governed by the kernel
  • 17.
    f(θ|x) is nowgoverned by the likelihood Posterior probability distribution of θ Likelihood of x given θ We don’t know the prior so use a uniform distribution Expected A Posteriori (EAP) estimate of θ With a uniform prior, same as the Maximum Likelihood Estimate (MLE)
  • 18.
    Posterior variance andcredible intervals Posterior variance: how θ moves around Credible interval: f(θ|x) α/2 α/2 1 – α 100(1-α)% credible interval for θ
  • 19.
    Frequentist vs. Bayesian •θ is a constant! • 95% confidence interval (CI) means: Construct 100 CIs using 100 different samples. 95 of the 100 CIs will actually contain θ, the constant. • θ is a random variable! • The probability that θ lies within the 95% credible interval is 95%. θ 100 CIs f(θ|x) α/2 α/2 1 – α 100(1-α)% credible interval for θ
  • 20.
    Old criticisms onthe Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x). STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform distribution) is a subjective choice. NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST! MCMC (Markov Chain Monte Carlo), in particular, HMC (Hamiltonian Monte Carlo) and a variant implemented in stan Yes, but we use uniform priors, so our EAP estimates are also MLE estimates
  • 21.
    MCMC (Markov ChainMonte Carlo) • Methods for sampling θ repeatedly according to f(θ|x). • Construct Markov Chains of θ values so that, after a burn-in period (B values), all of the values obey f(θ|x). • Collect T’ values sequentially, throw away the initial B values, to obtain T = T’ – B realisations of θ. f(θ|x)
  • 22.
    Things that wecan obtain from the T realisations EAP for θ: Just take the average over T realisations of θ. Credible interval for θ: Just take the middle 100(1-α)% of the T realisations of θ. The above methods apply to quantities other than θ itself, including effect sizes. Probability that hypothesis H is correct: Just count the number of realisations that satisfy H and divide by T!
  • 23.
    HMC (Hamiltonian MonteCarlo) [Kruschke15, Neal11]: a state-of-the-art MCMC method (1) • Given a curved surface in an ideal physical world, any object would move on the surface while keeping the Hamiltonian (= potential energy + kinetic energy) constant. • The path of the object is governed by Hamiltonian’s equations of motion, solved by the leap-frog method with parameters ε (stepsize) and L (leapfrog steps). • To sample from f(θ|x) (our “curved surface”), we put an object on the surface and give it a push; after L units of time, record its position and give it another push... In practice, ε can be set automatically; a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
  • 24.
    HMC (Hamiltonian MonteCarlo) [Kruschke15, Neal11]: a state-of-the-art MCMC method (2) With HMC, r is often close to 1 ⇒ few rejections = efficient sampling
  • 25.
    • checks convergenceby comparing the variance across multiple chains with that within the chains. • quantifies sampling efficiency: “you have obtained a sample of size T, but that’s worth a sample of size when there is zero correlation within the chain.” In practice, we let T=100,000, which much more than enough to worry about convergence. Stan’s criteria for checking the chains
  • 26.
    HMC/Stan vs. otherMCMC methods [Hoffman+14] HMC’s features “allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis[-Hastings] or Gibbs sampling” [Kruschke15] “HMC can be more effective than the various samplers in JAGS and BUGS, especially for large complex models. [...] However, Stan is not universally faster or better (at this stage in its development).” In IR, [Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs Sampler), an open-source implementation of BUGS (Bayesian inference Using Gibbs Sampling); [Zhang+16] used Metropolis-Hastings.
  • 27.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 28.
    Consider the problemof comparing two means Paired test Unpaired (=Two-sample) test Classical one-sided test: H0 : S1 = S2 H1 : S1 > S2 p-value = P(D+|S1 = S2) : S1’s scores S2’s scores S1’s scores S2’s scores Probability of observing the observed data D or something more extreme under H0
  • 29.
    But what wereally want is... Paired test Unpaired (=Two-sample) test Classical one-sided test: H0 : S1 = S2 H1 : S1 > S2 p-value = P(D+|S1 = S2) : S1’s scores S2’s scores S1’s scores S2’s scores With the Bayesian approach, we can easily obtain P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by simply counting the number of realisations that satisfy S1>S2 and dividing by T!
  • 30.
    Statistical models (classicaland Bayesian) Paired test Classical paired t-test: score differences obey Bayesian: scores obey a bivariate normal distribution Unpaired (=Two-sample) test S1’s scores obey S2’s scores obey Classical test: Welch’s t-test Bayesian: Stan code [Toyoda15]
  • 31.
    Effect size (Glass’sΔ) [Okubo+12] • In the context of classical significance testing, [Sakai14SIGIRforum] stresses the importance of reporting sample effect sizes and confidence intervals (CIs). • Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s consider the diff between S1 and S2 standardised by an “ordinary” standard deviation (i.e., that of the baseline): • With classical tests, this can simply be estimated from the sample. • With Bayesian tests, and EAP and a credible interval for Δ can easily be obtained from the T realisations. Unlike Cohen’s d/Hedges’ g, free from the homoscedasticity assumption
  • 32.
    Proposal summary • Classicalpaired and unpaired t-tests can easily be replaced by Bayesian tests. • Report the following in papers: - EAP for the difference in population means; - 95% credible interval for the above difference; - P(S1>S2|D), i.e., probability that H: S1>S2 is correct; - EAP for Glass’s Δ; - 95% credible interval for the above Δ. Raw difference Standardised difference
  • 33.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 34.
    Purpose of theexperiments For both paired and unpaired tests: • How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis) differ from the p-value P(D+|S1=S2)? • How does the Bayesian credible interval differ from the classical CI? • How does the Bayesian EAP for Glass’s Δ differ from the classical sample-based Δ? Will the Bayesian approach turn the IR literature upside down?
  • 35.
    Data 20 runs =190 run pairs were compared (w/o considering the familywise error rate)
  • 36.
    P(S1<S2|D) vs p-valuefor paired tests Bottom line: p-value can be regarded as a reasonable approximation of P(S1<S2|D) which is what we really want! More results in paper
  • 37.
    Credible vs Confidenceintervals for paired tests Bottom line: CI can be regarded as a reasonable approximation of credible interval which is what we really want! More results in paper
  • 38.
    EAP Δ vssample Δ for paired tests The classical sample Δ probably underestimate small effect sizes. “the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” [Kruschke15] More results in paper
  • 39.
    EAP Δ vssample Δ for paired tests – an anomalous result • This measure from the TREC Temporal Summarisation Track [Aslam+16] is like a nugget-based F-measure over a timeline. • The actual score distribution of this measure is [0, 0.4021] rather than [0,1], with very low standard deviations (and hence very high effect sizes). • The measure is not as well-understood as the others and deserves an investigation.
  • 40.
    P(S1<S2|D) vs p-valuefor unpaired tests Bottom line: p-value can be regarded as a reasonable approximation of P(S1<S2|D) which is what we really want! More results in paper
  • 41.
    Credible vs Confidenceintervals for unpaired tests Bottom line: CI can be regarded as a reasonable approximation of credible interval which is what we really want! More results in paper
  • 42.
    EAP Δ vssample Δ for unpaired tests The classical sample Δ probably underestimate small effect sizes. “the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” [Kruschke15] same sample Δ from the paired test experiments
  • 43.
    EAP Δ vssample Δ for unpaired tests – an anomalous result • This measure from the TREC Temporal Summarisation Track [Aslam+16] is like a nugget-based F-measure over a timeline. • The actual score distribution of this measure is [0, 0.4021] rather than [0,1], with very low standard deviations (and hence very high effect sizes). • The measure is not as well-understood as the others and deserves an investigation.
  • 44.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 45.
    Install R, rstanand Rtools (for Windows) 1. Install R from https://www.r-project.org/ 2. On R, install a package called rstan 3. Install Rtools from https://cran.r-project.org/bin/windows/Rtools/ (check “EDIT the system PATH”)
  • 46.
    Install my samplescripts 1. Download http://waseda.box.com/SIGIR2017PACK 2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R . 3. Move sample score files run1_vs_run2.paired.R and run1_vs_run2.unpaired.R also to C:/work/R .
  • 47.
    Try running thescripts (paired test) (1) Try this (BayesPaired-sample.R) on the R interface: library(rstan) scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file) par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta war <- 1000 #burn-in ite <- 21000 #iteration including burn-in see <- 1234 #seed dig <- 3 #significant digits cha <- 5 #chains (number of trials = (ite-war)*cha) source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file) outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files data <- list( N=N, x=x ) #sample size and per-topic scores fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile ) # compile stan file and execute sampling (HMC may be used instead of NUTs) print( fit, pars=par, digits_summary=dig ) Five output csv files: C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv
  • 48.
    Try running thescripts (paired test) (2)
  • 49.
    Try running thescripts (paired test) (3) Switch to a UNIX-like environment and process the output csv files using this shell script: Start line number after burn-in T #chains = #csvfiles EAP/credible interval for the difference EAP/credible interval for the Δ using S2’s standard deviation P(S1>S2|D) and P(S1<S2|D) EAP/credible interval for the correlation coefficient The thresholds for the probabilities can be modified by editing the script
  • 50.
    Try running thescripts (unpaired test) (1) Try this (BayesPaired-sample.R) on the R interface: library(rstan) scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file) par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta war <- 1000 #burn-in ite <- 21000 #iteration including burn-in see <- 1234 #seed dig <- 3 #significant digits cha <- 5 #chains (number of trials = (ite-war)*cha) source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file) outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile ) # compile stan file and execute sampling (HMC may be used instead of NUTs) print( fit, pars=par, digits_summary=dig ) Five output csv files: C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv
  • 51.
    Try running thescripts (unpaired test) (2)
  • 52.
    Try running thescripts (unpaired test) (3) Switch to a UNIX-like environment and process the output csv files using this shell script: Start line number after burn-in T #chains = #csvfiles EAP/credible interval for the difference P(S1>S2|D) and P(S1<S2|D) The thresholds for the probabilities can be modified by editing the script Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value, P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189). EAP/credible interval for the Δ using S2’s standard deviation
  • 53.
    TALK OUTLINE 1. Limitationsof classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 54.
    Takeaways • Many statisticiansnow use Bayesian statistics to discuss P(H|D) instead of the classical p-values (P(D+|H)). Likewise, IR community should not shy from the Bayesian approach, since it enables us to easily discuss P(H|D) for virtually any Hypothesis H. • Results in the IR literature which relied on classical significance tests are not necessarily wrong, since P(H|D) and credible intervals are actually quite similar to p-values and confidence intervals. • Starting today, report the right statistics, including the effect sizes, using Bayesian statistics (perhaps along with classical ones). Simple tools are available from: http://waseda.box.com/SIGIR2017PACK
  • 55.
    Acknowledgements: I thank •Professor Hideki Toyoda (Waseda University) For letting me tweak his R code and for his wonderful books on Bayesian tests (in Japanese). Professor Toyoda’s original code is available at http://www.asakura.co.jp/G_27_2.php?id=200 (with comments in Japanese) • Dr. Matthew Ekstrand-Abueg (Google) For letting me play with the TREC Temporal Summarisation results!
  • 56.
    References (1) [Aslam+16] TREC2015 Temporal Summarization Track, TREC 2015, 2016. [Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London, 53, 1763. [Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931). [Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR 2015. [Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990. [Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd, 1970. [Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental Psychology: General, 142(2), 2013. [Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
  • 57.
    References (2) [Kruschke15] DoingBayesian Data Analysis. Elsevier, 2015. [Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain Monte Carlo, Chapman & Hall, 2015. [Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval, and Power. Keiso Shobo, 2012. [Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014. [Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015. [Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose. The American Statistician, 2016. [Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016. [Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, 2008.