Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
Explaining "Explaining Variational Approximation" by JT Ormerod and MP Wand (2010).
I wanted to learn variational methods since its speed for Bayesian inference is just so fast! Here's my condensed version of the paper without the cool examples...you should really try the examples out if you want a better understanding of this method!
This presentation assumes some knowledge or experience with Bayesian methods.
To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce new computable quality measures based on Stein's method that quantify the maximum discrepancy between sample and target expectations over a large class of test functions. We use our tools to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, biased sampler selection, one-sample hypothesis testing, and sample quality improvement.
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
Explaining "Explaining Variational Approximation" by JT Ormerod and MP Wand (2010).
I wanted to learn variational methods since its speed for Bayesian inference is just so fast! Here's my condensed version of the paper without the cool examples...you should really try the examples out if you want a better understanding of this method!
This presentation assumes some knowledge or experience with Bayesian methods.
To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce new computable quality measures based on Stein's method that quantify the maximum discrepancy between sample and target expectations over a large class of test functions. We use our tools to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, biased sampler selection, one-sample hypothesis testing, and sample quality improvement.
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Christian Robert
Those are the slides for my conference talk at 2013 WSC, in the "Jacob Bernoulli's "Ars Conjectandi" and the emergence of probability" session organised by Adam Jakubowski
Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task
by
Tetsuya Sakai, Sijie Tao, Zhaohao Zeng
Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu
Maria Maistro
Zhicheng Dou
Nicola Ferro
Ian Soboroff
Sakai, T: The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students, Proceedings of EVIA 2017, pp.31-38, 2017.
http://ceur-ws.org/Vol-2008/paper_5.pdf
- Sakai, T: Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations, Proceedings of EVIA 2017, pp.24-30, 2017.
http://ceur-ws.org/Vol-2008/paper_4.pdf
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
sigir2017bayesian
1. The Probability That
Your Hypothesis Is Correct,
Credible Intervals, and
Effect Sizes for IR Evaluation
Tetsuya Sakai
Waseda University
tetsuyasakai@acm.org
@tetsuyasakai
August 8, 2017@SIGIR2017, Tokyo.
2. Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK
3. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
4. 1. P-values can indicate how incompatible the data are with a
specified statistical model.
2. P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
3. Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold.
[Wasserstein+16]
5. 4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
P-value = P(D+|H)
Probability of observing the observed data D or
something more extreme UNDER Hypothesis H.
[Wasserstein+16]
6. Problems with classical significance testing
• Statistical significance ≠ practical significance
• Dichotomous thinking: statistically significant (p<0.05) or not
• P-values and CIs are often misunderstood (see the ASA statements)
• Even if the p-value is reported, p-value = f(effect_size, sample_size)
large effect_size (magnitude of difference) ⇒ small p-value
large sample_size (e.g. #topics) ⇒ small p-value.
So effect sizes should be reported [Sakai14SIGIRforum].
“I have learned and taught that the primary product of a research inquiry is
one or more measures of effect size, not p values” [Cohen1990]
8. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
9. Statisticians are going Bayesian
According to [Toyoda15], over one-half of Biometrika papers published
in 2014 utilised Bayesian statistics.
William S. Gosset
10. Bayes’ rule (x: data;
θ: parameter, e.g. population mean)
Posterior probability
distribution of θ
Likelihood
of x given θ
Prior probability
distribution of θ
Normalising constant that ensures
[Bayes1763]
12. Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
13. Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
Bayesian methods can discuss P(H|D) directly and can easily
handle various hypotheses. There really is no reason to reject them now.
But classical tests also rely on assumptions
14. [Kruschke13] Journal of Experimental Psychology
“Some people may wonder which approach, Bayesian
or NHST, is more often correct. This question has
limited applicability because in real research we never
know the ground truth: all we have is a sample of data.
[...] the relevant question is asking which method
provides the richest, most informative, and meaningful
results for any set of data. The answer is always
Bayesian estimation.”
NHST = Null Hypothesis Significance Testing
15. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
16. f(θ|x) is governed by the kernel
Posterior probability
distribution of θ
Normalising constant that ensures
The property of f(θ|x)
is governed by
the kernel
17. f(θ|x) is now governed by the likelihood
Posterior probability
distribution of θ
Likelihood
of x given θ
We don’t know the prior so use a
uniform distribution
Expected A Posteriori (EAP) estimate of θ
With a uniform
prior,
same as the
Maximum
Likelihood
Estimate (MLE)
18. Posterior variance and credible intervals
Posterior variance: how θ moves around
Credible interval:
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ
19. Frequentist vs. Bayesian
• θ is a constant!
• 95% confidence interval (CI)
means:
Construct 100 CIs using 100
different samples. 95 of the 100
CIs will actually contain θ, the
constant.
• θ is a random variable!
• The probability that θ lies within
the 95% credible interval is 95%.
θ
100 CIs
f(θ|x)
α/2 α/2
1 – α
100(1-α)% credible interval for θ
20. Old criticisms on the Bayesian approach
(1) Nobody knows the prior probability distribution f(θ) and your
choice of f(θ) is highly subjective.
(2) It is computationally not feasible to obtain the posterior probability
distribution f(θ|x).
STILL A VALID CRITICISM. Even a noninformative prior (e.g. uniform
distribution) is a subjective choice.
NO LONGER VALID. RELIABLE SAMPLING-BASED SOLUTIONS EXIST!
MCMC (Markov Chain Monte Carlo), in particular,
HMC (Hamiltonian Monte Carlo) and a variant implemented in stan
Yes, but we use uniform priors,
so our EAP estimates are also MLE estimates
21. MCMC (Markov Chain Monte Carlo)
• Methods for sampling θ repeatedly according to f(θ|x).
• Construct Markov Chains of θ values so that, after a burn-in period (B
values), all of the values obey f(θ|x).
• Collect T’ values sequentially, throw away the initial B values, to
obtain T = T’ – B realisations of θ.
f(θ|x)
22. Things that we can obtain from the T realisations
EAP for θ: Just take the average over T realisations of θ.
Credible interval for θ: Just take the middle 100(1-α)% of the T
realisations of θ.
The above methods apply to quantities other than θ itself, including
effect sizes.
Probability that hypothesis H is correct: Just count the number of
realisations that satisfy H and divide by T!
23. HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (1)
• Given a curved surface in an ideal physical world, any object would
move on the surface while keeping the Hamiltonian (= potential
energy + kinetic energy) constant.
• The path of the object is governed by Hamiltonian’s equations of
motion, solved by the leap-frog method with parameters ε (stepsize)
and L (leapfrog steps).
• To sample from f(θ|x) (our “curved surface”), we put an object on the
surface and give it a push; after L units of time, record its position and
give it another push...
In practice, ε can be set automatically;
a variant of HMC called NUTS (No-U-Turn Sampler) sets L automatically.
24. HMC (Hamiltonian Monte Carlo) [Kruschke15, Neal11]:
a state-of-the-art MCMC method (2)
With HMC, r is often close to 1
⇒ few rejections = efficient sampling
25. • checks convergence by comparing the variance across multiple
chains with that within the chains.
• quantifies sampling efficiency: “you have obtained a sample of
size T, but that’s worth a sample of size when there is zero
correlation within the chain.”
In practice, we let T=100,000, which much more than enough to worry
about convergence.
Stan’s criteria for checking the chains
26. HMC/Stan vs. other MCMC methods
[Hoffman+14] HMC’s features “allow it to converge to high-dimensional
target distributions much more quickly than simpler methods such as
random walk Metropolis[-Hastings] or Gibbs sampling”
[Kruschke15] “HMC can be more effective than the various samplers in JAGS
and BUGS, especially for large complex models. [...] However, Stan is not
universally faster or better (at this stage in its development).”
In IR,
[Carterette11,15] used Gibbs sampling with JAGS (Just Another Gibbs
Sampler), an open-source implementation of BUGS (Bayesian inference
Using Gibbs Sampling);
[Zhang+16] used Metropolis-Hastings.
27. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
28. Consider the problem of comparing two means
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
Probability of observing the observed data D
or something more extreme under H0
29. But what we really want is...
Paired test Unpaired (=Two-sample) test
Classical one-sided test:
H0 : S1 = S2 H1 : S1 > S2
p-value = P(D+|S1 = S2)
:
S1’s scores S2’s scores S1’s scores S2’s scores
With the Bayesian approach, we can easily obtain
P(S1>S2|D) (or P(S1<S2|D) = 1 – P(S1>S2|D)) by
simply counting the number of realisations that satisfy S1>S2
and dividing by T!
30. Statistical models (classical and Bayesian)
Paired test
Classical paired t-test:
score differences obey
Bayesian: scores obey
a bivariate
normal
distribution
Unpaired (=Two-sample) test
S1’s scores obey
S2’s scores obey
Classical test: Welch’s t-test
Bayesian:
Stan code
[Toyoda15]
31. Effect size (Glass’s Δ) [Okubo+12]
• In the context of classical significance testing, [Sakai14SIGIRforum] stresses
the importance of reporting sample effect sizes and confidence intervals
(CIs).
• Given your system S1 and a well-known baseline S2 (e.g. BM25), let’s
consider the diff between S1 and S2 standardised by an “ordinary”
standard deviation (i.e., that of the baseline):
• With classical tests, this can simply be estimated from the sample.
• With Bayesian tests, and EAP and a credible interval for Δ can easily be
obtained from the T realisations.
Unlike Cohen’s d/Hedges’ g, free from the
homoscedasticity assumption
32. Proposal summary
• Classical paired and unpaired t-tests can easily be replaced by
Bayesian tests.
• Report the following in papers:
- EAP for the difference in population means;
- 95% credible interval for the above difference;
- P(S1>S2|D), i.e., probability that H: S1>S2 is correct;
- EAP for Glass’s Δ;
- 95% credible interval for the above Δ.
Raw
difference
Standardised
difference
33. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
34. Purpose of the experiments
For both paired and unpaired tests:
• How does the Bayesian P(S1<S2|D) (i.e., the less likely hypothesis)
differ from the p-value P(D+|S1=S2)?
• How does the Bayesian credible interval differ from the classical CI?
• How does the Bayesian EAP for Glass’s Δ differ from the classical
sample-based Δ?
Will the Bayesian approach turn the IR literature upside down?
35. Data
20 runs = 190 run pairs were compared
(w/o considering the familywise error rate)
36. P(S1<S2|D) vs p-value for paired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
37. Credible vs Confidence intervals for paired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
38. EAP Δ vs sample Δ for paired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
More results in paper
39. EAP Δ vs sample Δ for paired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
40. P(S1<S2|D) vs p-value for unpaired tests
Bottom line:
p-value can be regarded as a
reasonable approximation
of P(S1<S2|D) which is what we
really want!
More results in paper
41. Credible vs Confidence intervals for unpaired tests
Bottom line: CI can be
regarded as a reasonable
approximation of
credible interval which
is what we really want!
More results in paper
42. EAP Δ vs sample Δ for unpaired tests
The classical sample Δ
probably underestimate
small effect sizes.
“the relevant question is asking which
method provides the richest, most
informative, and meaningful results for
any set of data. The answer is always
Bayesian estimation.” [Kruschke15]
same sample Δ from the paired test experiments
43. EAP Δ vs sample Δ for unpaired tests – an anomalous result
• This measure from the TREC Temporal
Summarisation Track [Aslam+16] is like a
nugget-based F-measure over a timeline.
• The actual score distribution of this
measure is [0, 0.4021] rather than [0,1],
with very low standard deviations (and
hence very high effect sizes).
• The measure is not as well-understood as
the others and deserves an investigation.
44. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
45. Install R, rstan and Rtools (for Windows)
1. Install R from https://www.r-project.org/
2. On R, install a package called rstan
3. Install Rtools from https://cran.r-project.org/bin/windows/Rtools/
(check “EDIT the system PATH”)
46. Install my sample scripts
1. Download http://waseda.box.com/SIGIR2017PACK
2. Move modelPaired2.stan and modelUnpaired2.stan to C:/work/R .
3. Move sample score files run1_vs_run2.paired.R and
run1_vs_run2.unpaired.R also to C:/work/R .
47. Try running the scripts (paired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelPaired2.stan" #model written in Stan (see sample file)
par <- c( "mu", "Sigma", "rho", "delta", "glass" ) #mean, variances, correlation, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.paired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.paired.csv" #output files
data <- list( N=N, x=x ) #sample size and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.paired_[1-5].csv
49. Try running the scripts (paired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
EAP/credible interval for the Δ
using S2’s standard deviation
P(S1>S2|D) and P(S1<S2|D)
EAP/credible interval for the
correlation coefficient
The thresholds for the
probabilities can be
modified by editing the script
50. Try running the scripts (unpaired test) (1)
Try this (BayesPaired-sample.R) on the R interface:
library(rstan)
scr <- "C:/work/R/modelUnpaired2.stan" #model written in Stan (see sample file)
par <- c( "mu1", "mu2", "sigma1", "sigma2", "delta", "glass" ) #means, s.d.'s, delta, glass's delta
war <- 1000 #burn-in
ite <- 21000 #iteration including burn-in
see <- 1234 #seed
dig <- 3 #significant digits
cha <- 5 #chains (number of trials = (ite-war)*cha)
source( "C:/work/R/run1_vs_run2.unpaired.R" ) #per-topic scores (see sample file)
outfile <- "C:/work/R/run1_vs_run2.NUTS.unpaired.csv" #output files
data <- list( N1=N1, N2=N2, x1=x1, x2=x2 ) #sample sizes and per-topic scores
fit <- stan( file=scr, model_name=scr, data=data, pars=par, verbose=F, seed=see, algorithm="NUTS", chains=cha, warmup=war, iter=ite, sample_file=outfile )
# compile stan file and execute sampling (HMC may be used instead of NUTs)
print( fit, pars=par, digits_summary=dig )
Five output csv files:
C:/work/R/run1_vs_run2.NUTS.unpaired_[1-5].csv
52. Try running the scripts (unpaired test) (3)
Switch to a UNIX-like environment and process the output csv files
using this shell script: Start line number after burn-in T #chains = #csvfiles
EAP/credible interval for the difference
P(S1>S2|D) and P(S1<S2|D)
The thresholds for the
probabilities can be
modified by editing the script
Just as an unpaired t-test p-value is larger than the corresponding paired t-test p-value,
P(S1<S2|D) with the unpaired Bayesian test (0.165) is larger than that with the
paired one (0.006). The unpaired Bayesian also overestimates Δ (0.222 vs. 0.189).
EAP/credible interval for the Δ
using S2’s standard deviation
53. TALK OUTLINE
1. Limitations of classical significance testing (yet again)
2. Why go Bayesian
3. Bayesian basics
4. Proposals
5. Experiments: Bayesian vs. Frequentist
6. How to go Bayesian, with R and stan
7. Takeaways again
54. Takeaways
• Many statisticians now use Bayesian statistics to discuss P(H|D)
instead of the classical p-values (P(D+|H)). Likewise, IR community
should not shy from the Bayesian approach, since it enables us to
easily discuss P(H|D) for virtually any Hypothesis H.
• Results in the IR literature which relied on classical significance tests
are not necessarily wrong, since P(H|D) and credible intervals are
actually quite similar to p-values and confidence intervals.
• Starting today, report the right statistics, including the effect sizes,
using Bayesian statistics (perhaps along with classical ones). Simple
tools are available from:
http://waseda.box.com/SIGIR2017PACK
55. Acknowledgements: I thank
• Professor Hideki Toyoda (Waseda University)
For letting me tweak his R code and for his wonderful books on
Bayesian tests (in Japanese).
Professor Toyoda’s original code is available at
http://www.asakura.co.jp/G_27_2.php?id=200
(with comments in Japanese)
• Dr. Matthew Ekstrand-Abueg (Google)
For letting me play with the TREC Temporal Summarisation results!
56. References (1)
[Aslam+16] TREC 2015 Temporal Summarization Track, TREC 2015, 2016.
[Bayes1763] An Essay towards Solving a Problem in the Doctrine of Chances.
Philosophical Transactions of the Royal Society of London, 53, 1763.
[Carterette11] Model-based Inference about IR Systems. ICTIR 2011 (LNCS 6931).
[Carterette15] Bayesian Inference for Information Retrieval Evaluation, ACM ICTIR
2015.
[Cohen1990] Things I Have Learned (So Far). American Psychologist, 45(12), 1990.
[Fisher1970] Statistical Methods for Research Workers (14th Edition). Oliver & Boyd,
1970.
[Kruschke13] Bayesian Estimation Supersedes the t test. Journal of Experimental
Psychology: General, 142(2), 2013.
[Hoffman+14] The No-U-Turn Sampler: Adaptively Setting Path Lengths in
Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 2014.
57. References (2)
[Kruschke15] Doing Bayesian Data Analysis. Elsevier, 2015.
[Neal11] MCMC using Hamiltonian Dynamics. In: Handbook of Markov Chain
Monte Carlo, Chapman & Hall, 2015.
[Okubo+12] Psychological Statistics to Tell Your Story: Effect Size, Confidence
Interval, and Power. Keiso Shobo, 2012.
[Sakai14forum] Statistical Reform in Information Retrieval? SIGIR Forum 48(1), 2014.
[Toyoda15] Fundamentals of Bayesian Statistics: Practical Getting Started by
Hamiltonian Monte Carlo Method (in Japanese). Asakura Shoten, 2015.
[Wasserstein+16] The ASA’s Statement on P-values: Context, Process, and Purpose.
The American Statistician, 2016.
[Zhang+16] Bayesian Performance Comparison of Text Classifiers. ACM SIGIR 2016.
[Ziliak+08] The Cult of Statistical Significance: How the Standard Error Costs Us Jobs,
Justice, and Lives. The University of Michigan Press, 2008.