call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
Statistical Flukes, the Higgs Discovery, and 5 Sigma
1. 11/5
1
Statistical Flukes, the Higgs Discovery,
and 5 Sigma
Deborah G. Mayo
Virginia Tech
(I) “5 sigma observed effect”.
One of the biggest science events of 2012-13 was the
announcement on July 4, 2012 of evidence for the discovery of
a Higgs particle based on a “5 sigma observed effect”.
With the March 2013 data analysis, the 5 sigma difference
grew to 7 sigmas.
2. 11/5
2
• Because the 5 sigma report refers to frequentist statistical
tests, the discovery was immediately imbued with
controversies from philosophy of statistics
• I’m an outsider to high energy physics, HEP, but (aside from
finding it fascinating), any philosopher of statistics worth her
salt should be able to illuminate some of the more public
controversies e.g., P-values.
Not difficult to do, fortunately.
3. 11/5
(II) Bad Science? (O’Hagan, prompted by Lindley)
To the ISBA: “Dear Bayesians: We’ve heard a lot about the
Higgs boson. ...Specifically, the news referred to a
confidence interval with 5-sigma limits.… Five standard
deviations, assuming normality, means a p-value of around
0.0000005…
Why such an extreme evidence requirement? We know from
a Bayesian perspective that this only makes sense if (a) the
existence of the Higgs boson has extremely small prior
probability and/or (b) the consequences of erroneously
announcing its discovery are dire in the extreme. …
…. Are the particle physics community completely wedded
to frequentist analysis? If so, has anyone tried to explain
what bad science that is?”
3
4. 11/5
4
Not bad science at all!
• HEP physicists are sophisticated with their statistical
methodology: they’d seen too many bumps disappear.
• They want to ensure that before announcing the
hypothesis H*: “a new particle has been discovered”
that:
H* has been given a severe run for its money.
Significance tests and cognate methods (confidence
intervals) are methods of choice here for good reason
5. 11/5
5
(III) Simple statistical significance test: ingredients
(i) Null or test hypothesis: in terms of an unknown parameter
μ in a statistical model, an idealized representation of
underlying data generation: a model of the detector
μ is the “global signal strength” parameter
H0: μ = 0 i.e., zero signal (background only hypothesis)
Η0: μ = 0 vs. Η1: μ > 0
μ = 1: Standard Model (SM) Higgs boson signal in addition to
the background
6. 11/5
6
Empirical data are modeled as observed values of a sample X
(random variable); here numbers of events of a type.
(ii). Test statistic or distance statistic: d(X)—the larger its
value the more inconsistent the data are with Η0 in the direction
of alternatives or discrepancies of interest.
d(X): how many excess events of a given type are observed
(from trillions of collisions) in comparison to what would be
expected from background alone (in the form of bumps).
d(X) has a known probability distribution under Η0 (and under
various alternatives).
7. 11/5
(iii). The P-value (or significance level) associated with d(x0)
is the probability of a difference as large or larger than d(x0),
under the assumption that H0 is true:
7
P-value=Pr(d(X) > d(x0); H0)
If the P-value is sufficiently small (e.g., .05, .01, .001)
d(x0) is said to be statistically significant (or significant at the
level reached)
d(X) can be given in terms of standard deviation units, or
sigma units
8. 11/5
8
The distribution of statistic d(X) is the sampling distribution
Pr(d(X) > 1; H0) = .16
Pr(d(X) > 2; H0) = .02
Pr(d(X) > 3; H0) =.001
Pr(d(X) > 4; H0) = .00003
Pr(d(X) > 5; H0)= .0000003
The probability of observing results as or more extreme as 5
sigmas, under H0, is approximately 1 in 3,500,000.
10. 11/5
The actual computations are based on simulating what it would
be like were Η0: μ = 0 (signal strength = 0), fortified with much
cross-checking of results.
So the significance test has:
1) Data x0 and hypotheses Η0: μ = 0 vs. Η1: μ > 0
2) A (distance) test statistic d(X)
3) Probability distribution of d(X) under the null and various
10
alternatives
11. 11/5
11
There’s generally a rule of interpretation:
• if d(X) > 5 sigma, infer discovery
• if d(X) > 2 sigma, get more data
We want methods with high capability to detect discrepancies
while avoiding mistaking spurious bumps as real.
12. 11/5
12
• First stage: test for a real effect
(Cox’s taxonomy: searching for structure)
Not a point against point test!
Cousins: H0 is Standard Model (SM) missing a piece
• Second stage: determine its properties, test SM vs “Beyond
SM” (BSM)
(Cox: embedded)
13. 11/5
13
(IV) The P-Value Police
When the July 2012 report came out, a number of people set
out to grade the different interpretations of the P-value report:
Larry Wasserman (“Normal Deviate” on his blog) called them
the “P-Value Police”.
• Job: to examine if reports by journalists and scientists could
by any stretch of the imagination be seen to have
misinterpreted the sigma levels as posterior probability
assignments to the various models and claims.
David Spiegelhalter: A well-known (Bayesian) statistician: risk
communication.
14. 11/5
14
Thumbs up or down
Thumbs up, to the ATLAS group report:
“A statistical combination of these channels and
others puts the significance of the signal at 5
sigma, meaning that only one experiment in
three million would see an apparent signal this
strong in a universe without a Higgs.”
Thumbs down to reports such as:
“There is less than a one in 3.5 million chance that their
results are a statistical fluke.”
Critics (Spiegelhalter) allege they are misinterpreting the P-value
as a posterior probability on H0.
15. 11/5
15
Not so.
H0 does not say the observed results are due to background
alone, or are flukes,
Η0: μ = 0
Although if H0 were true it follows that various results would
occur with specified probabilities.
(In particular, it entails that large bumps are improbable.)
16. 11/5
In fact it is an ordinary error probability.
Since it’s not just a single result, but a dynamic test procedure,
we can write it:
16
(1) Pr(Test T produces d(X) > 5; H0) ≤ .0000003
Note: (1) is not a conditional probability (that involves a prior)
Pr(Test T produces d(X) > 5 and H0)/ Pr(H0)
17. 11/5
17
(V) Detaching inference(s) from the evidence
True, the inference actually detached goes beyond a P-value
report. Infer:
(2)There is strong evidence for
(first) a genuine discrepancy from H0
(later) H*: a Higgs (or a Higgs-like) particle.
Gradations: indication, evidence, discovery (up to July 4, 2012)
Inferring (2) relies on an implicit principle of evidence.
18. 11/5
Test Principle #1: (statistical significance) Data provide
evidence for a genuine discrepancy from H0 (just) to the
extent that H0 would (very probably) have survived, were
H0 a reasonably adequate description of the process
generating the data.
(1)’ Pr(Test T produces d(X) < 5; H0) > .9999997
• With probability .9999997, the bumps would be smaller,
would behave like flukes, disappear with more data, not be
produced at both CMS and ATLAS, in a world given by H0.
• They didn’t disappear, they grew
(2) So, H*: a Higgs (or a Higgs-like) particle.
18
19. 11/5
19
Following the rule: Interpret 5 sigma bumps as a real effect (a
discrepancy from 0), you’d erroneously interpret data with
probability less than .0000003
An error probability
The warrant isn’t low long-run error (in a case like this) but
detaching an inference based on “strong argument from
coincidence”.
Qualifying claims by how well they have been probed
(precision, accuracy).
20. 11/5
Second Stage
Once the null is rejected, the job shifts to testing if various
parameters agree with the SM predictions.
Now the corresponding null hypothesis is the SM Higgs boson
The null hypothesis at the second stage
20
H[2]
0: SM Higgs boson: μ = 1
and discrepancies from it are probed, estimated with
confidence intervals
(Cousins)
21. 11/5
21
Takes us to the most important role served by statistical
significance tests: (requiring a 5 sigma excess for discovery):
It affords a standard for:
• (a) denying sufficient evidence of a new particle, inferring
“not a genuine effect”, and
• (b) ruling out values of various parameters, e.g., mass
ranges.
22. 11/5
22
(VI) Positive and Negative test results of the analysis
Positive (very low P-value): infer genuine effects
Negative (moderate P-value): deny real effects (infer flukes),
Deny excesses indicate BSM.
• At
both
stages,
they
were
engaged
in
exploration
for
BSM
physics
(beyond
the
standard
model)
• It
combined
testing,
estimating,
exploring.
23. 11/5
23
NYT: “Chasing the Higgs” [Dennis Overbye interviews
spokespeople Gianotti (ATLAS) and Tonelli (CMS).]
• Once a month they got bumps that were random flukes
“So ‘we crosscheck everything’ and ‘try to kill’ any
anomaly that might be merely random.”
They were convinced they had found evidence of extra
dimensions of space time “and then the signal faded like an
old tied balloon.”
24. 11/5
24
• “We’ve made many discoveries,” Dr. Tonelli said,
“most of them false.”
• “Ninety-nine percent of the time, that is just
what happens.”
What’s the difference between HEP physics and social
psychology (and other big data screening) where “most
results in most fields are false”, or so we keep hearing?
HEP physicists don’t publish on the basis of a single “nominal”
(or “local”) P-value.
25. 11/5
25
Look Elsewhere Effect (LEE)
A nominal (or local) P-value: the P-value at a particular, data-determined,
mass.
But the probability of so impressive a difference anywhere in a
mass range would be greater than the local one.
I take it that requiring a smaller P-value (i.e., bigger
difference), at least 5 sigma, is akin to adjusting for multiple
trials or look elsewhere effect LEE.
26. 11/5
26
“Game of Bump-Hunting” (Overbye)
“One bump on physicists’ charts…was disappearing. But
another was blooming like the shy girl at a dance. …. nobody
could remember exactly when she had come in. But she was
the one who would marry the prince.”
“It continued to grow over the fall until it had reached the 3-
sigma level — the chances of being a fluke [spurious
significance] were less than 1 in 740, enough for physicists to
admit it to the realm of “evidence” of something, but not yet a
discovery.”
27. 11/5
Background knowledge of how flukes behave:
• “If they were flukes, more data would make them fade into
the statistical background,
• If not, the bumps would grow in slow motion into a bona
fide discovery.”
• They give the bump a hard time, look at multiple decay
channels, and don’t tell the details of where they found her
to the other team.
• When two independent experiments find the same particle
signal at the same mass, it overcomes the multiple testing
and gives a strong argument.
27
28. 11/5
28
(VII) Possible Anomalies for SM
They also follow up bumps indicating discrepancies with
H[2]
0 SM Higgs boson: μ = 1
Hints of anomalies with the “plain vanilla” particle of the
Standard Model
(viewed as tests or corresponding interval estimates)
Even a year later they examined these anomalies with more
data.
29. 11/5
29
Curb your enthusiasm
Matt Strassler: “The excess (in favor of BSM properties)
became a bit smaller each time…. That’s an unfortunate sign,
if one is hoping the excess isn’t just a statistical fluke.”
Or they’d see the bump at ATLAS… and not CMS
“Taking all of the data, and not cherry picking…there’s
nothing here that you can call “evidence” for the much sought
BSM.” (Strassler)
Considering the frequent flukes, and the hot competition
between the ATLAS and CMS to be first, a tool for when to
“curb their enthusiasm” seems exactly what was wanted.
30. 11/5
So, this “negative” portion involves:
(a) denying BSM anomalies are real
(b) setting upper bounds for these discrepancies with the SM
30
Higgs
Each with its own test statistic and evidence g(x0)
H[2]
0 : SM Higgs boson: μ = 1
Failing to reject the null isn’t evidence for it, but they could set
upper bounds.
31. 11/5
31
Test Principle #2 (for non-significance): Data provide
evidence to rule out a discrepancy δ∗ to the extent that a
larger g(x0) would very probably have resulted if δ were as
great as δ∗
Detach δ < δ∗
(could equivalently be viewed as inferring a confidence
interval estimate δ = g(x0) + ε)
So these tools seem just the thing for this research
32. 11/5
32
(VIII) Conclusion O’Hagan published a digest of responses a
few days later
• “They surely would be willing to announce SM Higgs
discovery if they were 99.99% certain of the existence of
the SM Higgs” (and avoid the ad hoc 5 sigma)
Pr(SM Higgs) = .9999
• It would require prior probabilities to “SM Higgs” claim,
and prior distribution on the numerous “nuisance”
parameters of the background and the signal.
• Multivariate priors, correlations between parameters, joint
priors, and the catchall: P(data|not H*)
33. 11/5
33
• Even if all that were done and agreed upon, it would not
have given the kind of tools needed to find things out
Worse: spiked priors Pr(No SM Higgs)= Pr(SM Higgs)=.5
(not uninformative)
• Physicists believed in SM Higgs before building the big
collider, given the perfect predictive success of SM, its
simplicity–very different than having evidence for a
discovery.
• Others may believe (and fervently wish) that it will break
down somewhere.
34. 11/5
34
P-value police: Those who think we want a posterior
probability in H* might be sliding from what may be inferred
from this legitimate high probability:
Pr(test T would not reach 5 sigma; H0) > .9999997
With probability .9999997, our methods would show
that the bumps disappear, under the assumption data
are due to background H0.
They don’t disappear but grow.
Infer H*
Qualified by the test properties
35. 11/5
35
What’s passed with high severity?
H*: a Higgs boson consistent with the SM (at the levels
of precision and accuracy of these experiments)
An adequate account should also always report alternatives
that have not been well ruled out
• measurements not precise enough to rule out discrepancies
from a SM Higgs as large as 10%, 20%, 50%.
• There are rivals to the SM that would not have been
distinguishable with the given data (which went through a
lot of filtering, and triggering rules).
They will get more data in 2015, there’s talk of a more
precise detector being built
36. 11/5
36
REFERENCES (Online links)
• Atlas
report:
http://cds.cern.ch/record/1494183/files/ATLAS-‐
CONF-‐2012-‐162.pdf
• Atlas
Higgs
experiment,
public
results:
https://twiki.cern.ch/twiki/bin/view/AtlasPublic/HiggsPublicRes
ults
• CMS
Higgs
experiment,
public
results:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/PhysicsResultsH
IG
• Mayo,
D.
G.
and
Cox,
D.
R.
(2010).
"Frequentist
Statistics
as
a
Theory
of
Inductive
Inference"
in
Error
and
Inference:
Recent
Exchanges
on
Experimental
Reasoning,
Reliability
and
the
Objectivity
and
Rationality
of
Science
(D
Mayo
and
A.
Spanos
eds.),
Cambridge:
Cambridge
University
Press:
1-‐27.
This
paper
appeared
in
The
Second
Erich
L.
Lehmann
Symposium:
Optimality,
2006,
Lecture
37. 11/5
37
Notes-‐Monograph
Series,
Volume
49,
Institute
of
Mathematical
Statistics,
pp.
247-‐275.
• Cousins,
R.
(2014).
“The Jeffreys-Lindley Paradox and Discovery
Criteria in High Energy Physics” http://arxiv.org/abs/1310.3791
• O’Hagan
letter:
§ Original
letter
with
responses:
http://bayesian.org/forums/news/3648
§ 1st
link
in
a
group
of
discussions
of
the
letter:
http://errorstatistics.com/2012/07/11/is-‐particle-‐
physics-‐bad-‐science/
• Overbye,
D.
(March
15,
2013)
“Chasing
the
Higgs,”
New
York
Times:
http://www.nytimes.com/2013/03/05/science/chasing-‐the-‐
higgs-‐boson-‐how-‐2-‐teams-‐of-‐rivals-‐at-‐CERN-‐searched-‐for-‐physics-‐
most-‐elusive-‐particle.html?pagewanted=all&_r=0
38. 11/5
38
• Spiegelhalter,
D.
(August
7,
2012)
blog,
Understanding
Uncertainty
,
“Explaining
5
sigma
for
the
Higgs:
how
well
did
they
do?”
http://understandinguncertainty.org/explaining-‐5-‐sigma-‐higgs-‐
how-‐well-‐did-‐they-‐do
• Strassler,
M.
(July
2,
2013)
blog,
Of
Particular
Significance,
“A
Second
Higgs
Particle”:
http://profmattstrassler.com/2013/07/02/a-‐second-‐higgs-‐
particle/
• Wasserman,
L.
(July
11,
2012)
blog,
Normal
Deviate,
“The
Higgs
Boson
and
the
P-‐Value
Police”:
http://normaldeviate.wordpress.com/2012/07/11/the-‐higgs-‐
boson-‐and-‐the-‐p-‐value-‐police/