SlideShare a Scribd company logo
1 of 57
Download to read offline
The Probability That
Your Hypothesis Is Correct,
Credible Intervals, and
Effect Sizes for IR Evaluation
Tetsuya Sakai
Waseda University
tetsuyasakai@acm.org
@tetsuyasakai
August 8, 2017@SIGIR2017, Tokyo.
Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
1. P-values can indicate how incompatible the data are with a
specified statistical model.
2. P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
3. Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold.
[Wasserstein+16]
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
P-value = P(D+|H)
Probability of observing the observed data D or
something more extreme UNDER Hypothesis H.
[Wasserstein+16]
Problems with classical significance testing
• Statistical significance ≠ practical significance
• Dichotomous thinking: statistically significant (p<0.05) or not
• P-values and CIs are often misunderstood (see the ASA statements)
• Even if the p-value is reported, p-value = f(effect_size, sample_size)
large effect_size (magnitude of difference) ⇒ small p-value
large sample_size (e.g. #topics) ⇒ small p-value.
So effect sizes should be reported [Sakai14SIGIRforum].
“I have learned and taught that the primary product of a research inquiry is
one or more measures of effect size, not p values” [Cohen1990]
[Ziliak+08]
dichotomous
thinking
practical
significance
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Statisticians are going Bayesian
According to [Toyoda15], over one-half of Biometrika papers published
in 2014 utilised Bayesian statistics.
William S. Gosset
Bayes’ rule (x: data;
θ: parameter, e.g. population mean)
Posterior probability
distribution of θ
Likelihood
of x given θ
Prior probability
distribution of θ
Normalising constant that ensures
[Bayes1763]
[Fisher1970]
Fisher hated the
Bayesian approach
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
Bayesian methods can discuss P(H|D) directly and can easily
handle various hypotheses. There really is no reason to reject them now.
But classical tests also rely on assumptions
[Kruschke13] Journal of Experimental Psychology
“Some people may wonder which approach, Bayesian
or NHST, is more often correct. This question has
limited applicability because in real research we never
know the ground truth: all we have is a sample of data.
[...] the relevant question is asking which method
provides the richest, most informative, and meaningful
results for any set of data. The answer is always
Bayesian estimation.”
NHST = Null Hypothesis Significance Testing
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
f(θ|x) is governed by the kernel
Posterior probability
distribution of θ
Normalising constant that ensures
The property of f(θ|x)
is governed by
the kernel
f(θ|x) is now governed by the likelihood
Posterior probability
distribution of θ
Likelihood
of x given θ
We don’t know the prior so use a
uniform distribution
Expected A Posteriori (EAP) estimate of θ
With a uniform
prior,
same as the
Maximum
Likelihood
Estimate (MLE)
Posterior variance and credible intervals
Posterior variance: how θ moves around
Credible interval:
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ
Frequentist vs. Bayesian
• θ is a constant!
• 95% confidence interval (CI)
means:
Construct 100 CIs using 100
different samples. 95 of the 100
CIs will actually contain θ, the
constant.
• θ is a random variable!
• The probability that θ lies within
the 95% credible interval is 95%.
θ
100 CIs
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ
Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
MCMC (Markov Chain Monte Carlo), in particular,
HMC (Hamiltonian Monte Carlo) and a variant implemented in stan
Yes, but we use uniform priors,
so our EAP estimates are also MLE estimates
MCMC (Markov Chain Monte Carlo)
• Methods for sampling θ repeatedly according to f(θ|x).
• Construct Markov Chains of θ values so that, after a burn-in period (B
values), all of the values obey f(θ|x).
• Collect T’ values sequentially, throw away the initial B values, to
obtain T = T’ – B realisations of θ.
f(θ|x)
Things that we can obtain from the T realisations
EAP for θ: Just take the average over T realisations of θ.
Credible interval for θ: Just take the middle 100(1-α)% of the T
realisations of θ.
The above methods apply to quantities other than θ itself, including
effect sizes.
Probability that hypothesis H is correct: Just count the number of
realisations that satisfy H and divide by T!
HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (1)
• Given a curved surface in an ideal physical world, any object would
move on the surface while keeping the Hamiltonian (= potential
energy + kinetic energy) constant.
• The path of the object is governed by Hamiltonian’s equations of
motion, solved by the leap-frog method with parameters ε (stepsize)
and L (leapfrog steps).
• To sample from f(θ|x) (our “curved surface”), we put an object on the
surface and give it a push; after L units of time, record its position and
give it another push...
In practice, ε can be set automatically;
a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (2)
With HMC, r is often close to 1
⇒ few rejections = efficient sampling
• checks convergence by comparing the variance across multiple
chains with that within the chains.
• quantifies sampling efficiency: “you have obtained a sample of
size T, but that’s worth a sample of size when there is zero
correlation within the chain.”
In practice, we let T=100,000, which much more than enough to worry
about convergence.
Stan’s criteria for checking the chains
HMC/Stan vs. other MCMC methods
[Hoffman+14] HMC’s features “allow it to converge to high-dimensional
target distributions much more quickly than simpler methods such as
random walk Metropolis[-Hastings] or Gibbs sampling”
[Kruschke15] “HMC can be more effective than the various samplers in JAGS
and BUGS, especially for large complex models. [...] However, Stan is not
universally faster or better (at this stage in its development).”
In IR,
[Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs
Sampler), an open-source implementation of BUGS (Bayesian inference
Using Gibbs Sampling);
[Zhang+16] used Metropolis-Hastings.
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Consider the problem of comparing two means
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
Probability of observing the observed data D
or something more extreme under H0
But what we really want is...
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
With the Bayesian approach, we can easily obtain
P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by
simply counting the number of realisations that satisfy S1>S2
and dividing by T!
Statistical models (classical and Bayesian)
Paired test
Classical paired t-test:
score differences obey
Bayesian: scores obey
a bivariate
normal
distribution
Unpaired (=Two-sample) test
S1’s scores obey
S2’s scores obey
Classical test: Welch’s t-test
Bayesian:
Stan code
[Toyoda15]
Effect size (Glass’s Δ) [Okubo+12]
• In the context of classical significance testing, [Sakai14SIGIRforum] stresses
the importance of reporting sample effect sizes and confidence intervals
(CIs).
• Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s
consider the diff between S1 and S2 standardised by an “ordinary”
standard deviation (i.e., that of the baseline):
• With classical tests, this can simply be estimated from the sample.
• With Bayesian tests, and EAP and a credible interval for Δ can easily be
obtained from the T realisations.
Unlike Cohen’s d/Hedges’ g, free from the
homoscedasticity assumption
Proposal summary
• Classical paired and unpaired t-tests can easily be replaced by
Bayesian tests.
• Report the following in papers:
- EAP for the difference in population means;
- 95% credible interval for the above difference;
- P(S1>S2|D), i.e., probability that H: S1>S2 is correct;
- EAP for Glass’s Δ;
- 95% credible interval for the above Δ.
Raw
difference
Standardised
difference
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Purpose of the experiments
For both paired and unpaired tests:
• How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis)
differ from the p-value P(D+|S1=S2)?
• How does the Bayesian credible interval differ from the classical CI?
• How does the Bayesian EAP for Glass’s Δ differ from the classical
sample-based Δ?
Will the Bayesian approach turn the IR literature upside down?
Data
20 runs = 190 run pairs were compared
(w/o considering the familywise error rate)
P(S1<S2|D) vs p-value for paired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
Credible vs Confidence intervals for paired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
EAP Δ vs sample Δ for paired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
More results in paper
EAP Δ vs sample Δ for paired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
P(S1<S2|D) vs p-value for unpaired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
Credible vs Confidence intervals for unpaired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
EAP Δ vs sample Δ for unpaired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
same sample Δ from the paired test experiments
EAP Δ vs sample Δ for unpaired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Install R, rstan and Rtools (for Windows)
1. Install R from https://www.r-project.org/
2. On R, install a package called rstan
3. Install Rtools from https://cran.r-project.org/bin/windows/Rtools/
(check “EDIT the system PATH”)
Install my sample scripts
1. Download http://waseda.box.com/SIGIR2017PACK
2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R .
3. Move sample score files run1_vs_run2.paired.R and
run1_vs_run2.unpaired.R also to C:/work/R .
Try running the scripts (paired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file)
par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files
data <- list( N=N, x=x ) #sample size and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv
Try running the scripts (paired test) (2)
Try running the scripts (paired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
EAP/credible interval for the Δ
using S2’s standard deviation
P(S1>S2|D) and P(S1<S2|D)
EAP/credible interval for the
correlation coefficient
The thresholds for the
probabilities can be
modified by editing the script
Try running the scripts (unpaired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file)
par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files
data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv
Try running the scripts (unpaired test) (2)
Try running the scripts (unpaired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
P(S1>S2|D) and P(S1<S2|D)
The thresholds for the
probabilities can be
modified by editing the script
Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value,
P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the
paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189).
EAP/credible interval for the Δ
using S2’s standard deviation
TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK
Acknowledgements: I thank
• Professor Hideki Toyoda (Waseda University)
For letting me tweak his R code and for his wonderful books on
Bayesian tests (in Japanese).
Professor Toyoda’s original code is available at
http://www.asakura.co.jp/G_27_2.php?id=200
(with comments in Japanese)
• Dr. Matthew Ekstrand-Abueg (Google)
For letting me play with the TREC Temporal Summarisation results!
References (1)
[Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016.
[Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances.
Philosophical Transactions of the Royal Society of London, 53, 1763.
[Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931).
[Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR
2015.
[Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990.
[Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd,
1970.
[Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental
Psychology: General, 142(2), 2013.
[Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
References (2)
[Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015.
[Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain
Monte Carlo, Chapman & Hall, 2015.
[Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence
Interval, and Power. Keiso Shobo, 2012.
[Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014.
[Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by
Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015.
[Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose.
The American Statistician, 2016.
[Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016.
[Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs,
Justice, and Lives. The University of Michigan Press, 2008.

More Related Content

What's hot

ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: FoundationsChristian Robert
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
 
Nature inspired metaheuristics
Nature inspired metaheuristicsNature inspired metaheuristics
Nature inspired metaheuristicsGem WeBlog
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Christian Robert
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoodsChristian Robert
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics SeminarChristian Robert
 
Analysis of optimization algorithms
Analysis of optimization algorithmsAnalysis of optimization algorithms
Analysis of optimization algorithmsGem WeBlog
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
ML_Unit_1_Part_B
ML_Unit_1_Part_BML_Unit_1_Part_B
ML_Unit_1_Part_BSrimatre K
 
Cwkaa 2010
Cwkaa 2010Cwkaa 2010
Cwkaa 2010Sam Neaves
 

What's hot (20)

ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
 
2주차
2주차2주차
2주차
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 
Nature inspired metaheuristics
Nature inspired metaheuristicsNature inspired metaheuristics
Nature inspired metaheuristics
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Boston talk
Boston talkBoston talk
Boston talk
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar
 
Analysis of optimization algorithms
Analysis of optimization algorithmsAnalysis of optimization algorithms
Analysis of optimization algorithms
 
Machine learning
Machine learningMachine learning
Machine learning
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
ML_Unit_1_Part_B
ML_Unit_1_Part_BML_Unit_1_Part_B
ML_Unit_1_Part_B
 
Cwkaa 2010
Cwkaa 2010Cwkaa 2010
Cwkaa 2010
 
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
2018 MUMS Fall Course - Essentials of Bayesian Hypothesis Testing - Jim Berge...
 

Similar to sigir2017bayesian

sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorialTetsuya Sakai
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsTetsuya Sakai
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsChristian Robert
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Christian Robert
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdfGerryMakilan2
 
Bayesian intro
Bayesian introBayesian intro
Bayesian introBayesLaplace1
 
Chernick.Michael.ppt
Chernick.Michael.pptChernick.Michael.ppt
Chernick.Michael.pptABINASHPADHY6
 
Chernick.Michael.ppt
Chernick.Michael.pptChernick.Michael.ppt
Chernick.Michael.pptalizain9604
 
Chernick.Michael (1).ppt
Chernick.Michael (1).pptChernick.Michael (1).ppt
Chernick.Michael (1).pptalizain9604
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
chi_square test.pptx
chi_square test.pptxchi_square test.pptx
chi_square test.pptxSheetalSardhna
 
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsTetsuya Sakai
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdfKing Khalid University
 

Similar to sigir2017bayesian (20)

sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
 
Chi
ChiChi
Chi
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 
Basic of Hypothesis Testing TEKU QM
Basic of Hypothesis Testing TEKU QMBasic of Hypothesis Testing TEKU QM
Basic of Hypothesis Testing TEKU QM
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
 
Bayesian intro
Bayesian introBayesian intro
Bayesian intro
 
Chernick.Michael.ppt
Chernick.Michael.pptChernick.Michael.ppt
Chernick.Michael.ppt
 
Chernick.Michael.ppt
Chernick.Michael.pptChernick.Michael.ppt
Chernick.Michael.ppt
 
Chernick.Michael (1).ppt
Chernick.Michael (1).pptChernick.Michael (1).ppt
Chernick.Michael (1).ppt
 
11.pdf
11.pdf11.pdf
11.pdf
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
chi_square test.pptx
chi_square test.pptxchi_square test.pptx
chi_square test.pptx
 
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence Intervals
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 

sigir2017bayesian

  • 1. The Probability That Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation Tetsuya Sakai Waseda University tetsuyasakai@acm.org @tetsuyasakai August 8, 2017@SIGIR2017, Tokyo.
  • 2. Takeaways • Many statisticians now use Bayesian statistics to discuss P(H|D) instead of the classical p-values (P(D+|H)). Likewise, IR community should not shy from the Bayesian approach, since it enables us to easily discuss P(H|D) for virtually any Hypothesis H. • Results in the IR literature which relied on classical significance tests are not necessarily wrong, since P(H|D) and credible intervals are actually quite similar to p-values and confidence intervals. • Starting today, report the right statistics, including the effect sizes, using Bayesian statistics (perhaps along with classical ones). Simple tools are available from: http://waseda.box.com/SIGIR2017PACK
  • 3. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 4. 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. [Wasserstein+16]
  • 5. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. P-value = P(D+|H) Probability of observing the observed data D or something more extreme UNDER Hypothesis H. [Wasserstein+16]
  • 6. Problems with classical significance testing • Statistical significance ≠ practical significance • Dichotomous thinking: statistically significant (p<0.05) or not • P-values and CIs are often misunderstood (see the ASA statements) • Even if the p-value is reported, p-value = f(effect_size, sample_size) large effect_size (magnitude of difference) ⇒ small p-value large sample_size (e.g. #topics) ⇒ small p-value. So effect sizes should be reported [Sakai14SIGIRforum]. “I have learned and taught that the primary product of a research inquiry is one or more measures of effect size, not p values” [Cohen1990]
  • 8. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 9. Statisticians are going Bayesian According to [Toyoda15], over one-half of Biometrika papers published in 2014 utilised Bayesian statistics. William S. Gosset
  • 10. Bayes’ rule (x: data; θ: parameter, e.g. population mean) Posterior probability distribution of θ Likelihood of x given θ Prior probability distribution of θ Normalising constant that ensures [Bayes1763]
  • 12. Old criticisms on the Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x).
  • 13. Old criticisms on the Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x). STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform distribution) is a subjective choice. NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST! Bayesian methods can discuss P(H|D) directly and can easily handle various hypotheses. There really is no reason to reject them now. But classical tests also rely on assumptions
  • 14. [Kruschke13] Journal of Experimental Psychology “Some people may wonder which approach, Bayesian or NHST, is more often correct. This question has limited applicability because in real research we never know the ground truth: all we have is a sample of data. [...] the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” NHST = Null Hypothesis Significance Testing
  • 15. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 16. f(θ|x) is governed by the kernel Posterior probability distribution of θ Normalising constant that ensures The property of f(θ|x) is governed by the kernel
  • 17. f(θ|x) is now governed by the likelihood Posterior probability distribution of θ Likelihood of x given θ We don’t know the prior so use a uniform distribution Expected A Posteriori (EAP) estimate of θ With a uniform prior, same as the Maximum Likelihood Estimate (MLE)
  • 18. Posterior variance and credible intervals Posterior variance: how θ moves around Credible interval: f(θ|x) α/2 α/2 1 – α 100(1-α)% credible interval for θ
  • 19. Frequentist vs. Bayesian • θ is a constant! • 95% confidence interval (CI) means: Construct 100 CIs using 100 different samples. 95 of the 100 CIs will actually contain θ, the constant. • θ is a random variable! • The probability that θ lies within the 95% credible interval is 95%. θ 100 CIs f(θ|x) α/2 α/2 1 – α 100(1-α)% credible interval for θ
  • 20. Old criticisms on the Bayesian approach (1) Nobody knows the prior probability distribution f(θ) and your choice of f(θ) is highly subjective. (2) It is computationally not feasible to obtain the posterior probability distribution f(θ|x). STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform distribution) is a subjective choice. NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST! MCMC (Markov Chain Monte Carlo), in particular, HMC (Hamiltonian Monte Carlo) and a variant implemented in stan Yes, but we use uniform priors, so our EAP estimates are also MLE estimates
  • 21. MCMC (Markov Chain Monte Carlo) • Methods for sampling θ repeatedly according to f(θ|x). • Construct Markov Chains of θ values so that, after a burn-in period (B values), all of the values obey f(θ|x). • Collect T’ values sequentially, throw away the initial B values, to obtain T = T’ – B realisations of θ. f(θ|x)
  • 22. Things that we can obtain from the T realisations EAP for θ: Just take the average over T realisations of θ. Credible interval for θ: Just take the middle 100(1-α)% of the T realisations of θ. The above methods apply to quantities other than θ itself, including effect sizes. Probability that hypothesis H is correct: Just count the number of realisations that satisfy H and divide by T!
  • 23. HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]: a state-of-the-art MCMC method (1) • Given a curved surface in an ideal physical world, any object would move on the surface while keeping the Hamiltonian (= potential energy + kinetic energy) constant. • The path of the object is governed by Hamiltonian’s equations of motion, solved by the leap-frog method with parameters ε (stepsize) and L (leapfrog steps). • To sample from f(θ|x) (our “curved surface”), we put an object on the surface and give it a push; after L units of time, record its position and give it another push... In practice, ε can be set automatically; a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
  • 24. HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]: a state-of-the-art MCMC method (2) With HMC, r is often close to 1 ⇒ few rejections = efficient sampling
  • 25. • checks convergence by comparing the variance across multiple chains with that within the chains. • quantifies sampling efficiency: “you have obtained a sample of size T, but that’s worth a sample of size when there is zero correlation within the chain.” In practice, we let T=100,000, which much more than enough to worry about convergence. Stan’s criteria for checking the chains
  • 26. HMC/Stan vs. other MCMC methods [Hoffman+14] HMC’s features “allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis[-Hastings] or Gibbs sampling” [Kruschke15] “HMC can be more effective than the various samplers in JAGS and BUGS, especially for large complex models. [...] However, Stan is not universally faster or better (at this stage in its development).” In IR, [Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs Sampler), an open-source implementation of BUGS (Bayesian inference Using Gibbs Sampling); [Zhang+16] used Metropolis-Hastings.
  • 27. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 28. Consider the problem of comparing two means Paired test Unpaired (=Two-sample) test Classical one-sided test: H0 : S1 = S2 H1 : S1 > S2 p-value = P(D+|S1 = S2) : S1’s scores S2’s scores S1’s scores S2’s scores Probability of observing the observed data D or something more extreme under H0
  • 29. But what we really want is... Paired test Unpaired (=Two-sample) test Classical one-sided test: H0 : S1 = S2 H1 : S1 > S2 p-value = P(D+|S1 = S2) : S1’s scores S2’s scores S1’s scores S2’s scores With the Bayesian approach, we can easily obtain P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by simply counting the number of realisations that satisfy S1>S2 and dividing by T!
  • 30. Statistical models (classical and Bayesian) Paired test Classical paired t-test: score differences obey Bayesian: scores obey a bivariate normal distribution Unpaired (=Two-sample) test S1’s scores obey S2’s scores obey Classical test: Welch’s t-test Bayesian: Stan code [Toyoda15]
  • 31. Effect size (Glass’s Δ) [Okubo+12] • In the context of classical significance testing, [Sakai14SIGIRforum] stresses the importance of reporting sample effect sizes and confidence intervals (CIs). • Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s consider the diff between S1 and S2 standardised by an “ordinary” standard deviation (i.e., that of the baseline): • With classical tests, this can simply be estimated from the sample. • With Bayesian tests, and EAP and a credible interval for Δ can easily be obtained from the T realisations. Unlike Cohen’s d/Hedges’ g, free from the homoscedasticity assumption
  • 32. Proposal summary • Classical paired and unpaired t-tests can easily be replaced by Bayesian tests. • Report the following in papers: - EAP for the difference in population means; - 95% credible interval for the above difference; - P(S1>S2|D), i.e., probability that H: S1>S2 is correct; - EAP for Glass’s Δ; - 95% credible interval for the above Δ. Raw difference Standardised difference
  • 33. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 34. Purpose of the experiments For both paired and unpaired tests: • How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis) differ from the p-value P(D+|S1=S2)? • How does the Bayesian credible interval differ from the classical CI? • How does the Bayesian EAP for Glass’s Δ differ from the classical sample-based Δ? Will the Bayesian approach turn the IR literature upside down?
  • 35. Data 20 runs = 190 run pairs were compared (w/o considering the familywise error rate)
  • 36. P(S1<S2|D) vs p-value for paired tests Bottom line: p-value can be regarded as a reasonable approximation of P(S1<S2|D) which is what we really want! More results in paper
  • 37. Credible vs Confidence intervals for paired tests Bottom line: CI can be regarded as a reasonable approximation of credible interval which is what we really want! More results in paper
  • 38. EAP Δ vs sample Δ for paired tests The classical sample Δ probably underestimate small effect sizes. “the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” [Kruschke15] More results in paper
  • 39. EAP Δ vs sample Δ for paired tests – an anomalous result • This measure from the TREC Temporal Summarisation Track [Aslam+16] is like a nugget-based F-measure over a timeline. • The actual score distribution of this measure is [0, 0.4021] rather than [0,1], with very low standard deviations (and hence very high effect sizes). • The measure is not as well-understood as the others and deserves an investigation.
  • 40. P(S1<S2|D) vs p-value for unpaired tests Bottom line: p-value can be regarded as a reasonable approximation of P(S1<S2|D) which is what we really want! More results in paper
  • 41. Credible vs Confidence intervals for unpaired tests Bottom line: CI can be regarded as a reasonable approximation of credible interval which is what we really want! More results in paper
  • 42. EAP Δ vs sample Δ for unpaired tests The classical sample Δ probably underestimate small effect sizes. “the relevant question is asking which method provides the richest, most informative, and meaningful results for any set of data. The answer is always Bayesian estimation.” [Kruschke15] same sample Δ from the paired test experiments
  • 43. EAP Δ vs sample Δ for unpaired tests – an anomalous result • This measure from the TREC Temporal Summarisation Track [Aslam+16] is like a nugget-based F-measure over a timeline. • The actual score distribution of this measure is [0, 0.4021] rather than [0,1], with very low standard deviations (and hence very high effect sizes). • The measure is not as well-understood as the others and deserves an investigation.
  • 44. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 45. Install R, rstan and Rtools (for Windows) 1. Install R from https://www.r-project.org/ 2. On R, install a package called rstan 3. Install Rtools from https://cran.r-project.org/bin/windows/Rtools/ (check “EDIT the system PATH”)
  • 46. Install my sample scripts 1. Download http://waseda.box.com/SIGIR2017PACK 2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R . 3. Move sample score files run1_vs_run2.paired.R and run1_vs_run2.unpaired.R also to C:/work/R .
  • 47. Try running the scripts (paired test) (1) Try this (BayesPaired-sample.R) on the R interface: library(rstan) scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file) par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta war <- 1000 #burn-in ite <- 21000 #iteration including burn-in see <- 1234 #seed dig <- 3 #significant digits cha <- 5 #chains (number of trials = (ite-war)*cha) source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file) outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files data <- list( N=N, x=x ) #sample size and per-topic scores fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile ) # compile stan file and execute sampling (HMC may be used instead of NUTs) print( fit, pars=par, digits_summary=dig ) Five output csv files: C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv
  • 48. Try running the scripts (paired test) (2)
  • 49. Try running the scripts (paired test) (3) Switch to a UNIX-like environment and process the output csv files using this shell script: Start line number after burn-in T #chains = #csvfiles EAP/credible interval for the difference EAP/credible interval for the Δ using S2’s standard deviation P(S1>S2|D) and P(S1<S2|D) EAP/credible interval for the correlation coefficient The thresholds for the probabilities can be modified by editing the script
  • 50. Try running the scripts (unpaired test) (1) Try this (BayesPaired-sample.R) on the R interface: library(rstan) scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file) par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta war <- 1000 #burn-in ite <- 21000 #iteration including burn-in see <- 1234 #seed dig <- 3 #significant digits cha <- 5 #chains (number of trials = (ite-war)*cha) source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file) outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile ) # compile stan file and execute sampling (HMC may be used instead of NUTs) print( fit, pars=par, digits_summary=dig ) Five output csv files: C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv
  • 51. Try running the scripts (unpaired test) (2)
  • 52. Try running the scripts (unpaired test) (3) Switch to a UNIX-like environment and process the output csv files using this shell script: Start line number after burn-in T #chains = #csvfiles EAP/credible interval for the difference P(S1>S2|D) and P(S1<S2|D) The thresholds for the probabilities can be modified by editing the script Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value, P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189). EAP/credible interval for the Δ using S2’s standard deviation
  • 53. TALK OUTLINE 1. Limitations of classical significance testing (yet again) 2. Why go Bayesian 3. Bayesian basics 4. Proposals 5. Experiments: Bayesian vs. Frequentist 6. How to go Bayesian, with R and stan 7. Takeaways again
  • 54. Takeaways • Many statisticians now use Bayesian statistics to discuss P(H|D) instead of the classical p-values (P(D+|H)). Likewise, IR community should not shy from the Bayesian approach, since it enables us to easily discuss P(H|D) for virtually any Hypothesis H. • Results in the IR literature which relied on classical significance tests are not necessarily wrong, since P(H|D) and credible intervals are actually quite similar to p-values and confidence intervals. • Starting today, report the right statistics, including the effect sizes, using Bayesian statistics (perhaps along with classical ones). Simple tools are available from: http://waseda.box.com/SIGIR2017PACK
  • 55. Acknowledgements: I thank • Professor Hideki Toyoda (Waseda University) For letting me tweak his R code and for his wonderful books on Bayesian tests (in Japanese). Professor Toyoda’s original code is available at http://www.asakura.co.jp/G_27_2.php?id=200 (with comments in Japanese) • Dr. Matthew Ekstrand-Abueg (Google) For letting me play with the TREC Temporal Summarisation results!
  • 56. References (1) [Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016. [Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London, 53, 1763. [Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931). [Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR 2015. [Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990. [Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd, 1970. [Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental Psychology: General, 142(2), 2013. [Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
  • 57. References (2) [Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015. [Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain Monte Carlo, Chapman & Hall, 2015. [Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval, and Power. Keiso Shobo, 2012. [Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014. [Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015. [Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose. The American Statistician, 2016. [Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016. [Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. The University of Michigan Press, 2008.