MUMS: Bayesian, Fiducial, and Frequentist Conference - Calibration of Probability Forecasts, Vladimir Vovk, April 29, 2019

Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Nonparametric fiducial prediction
Vladimir Vovk
(based on joint work with many people)
Royal Holloway, University of London
BFF 2019, Duke University, 29 April 2019
Vladimir Vovk Nonparametric fiducial prediction 1

This talk
This talk: in one of the Fs in BFF (fiducial). (At least this is
my interpretation.)
It is mostly about fiducial prediction.
Moreover: it is mostly about nonparametric fiducial
prediction.
I will start from history.

Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
My plan
1 Parametric fiducial prediction
Terminology
2 Dempster–Hill procedure
3 Conformal predictive distributions

Terminology
Fisher’s publications
It appears that Fisher has only two publications that discuss
ﬁducial prediction:
R. A. Fisher.
Fiducial argument in statistical inference.
Annals of Eugenics, 1935.
R. A. Fisher.
Statistical Methods and Scientiﬁc Inference.
1956 (3rd edition: 1973).

Terminology
Fisher’s 1935 example
His example from the 1935 paper: the Gaussian IID model.
After observing y1, . . . , yn (past data), compute
¯y :=
1
n
n
i=1
yi, s2
:=
1
n − 1
n
i=1
(yi − ¯y)2
.
Then
t :=
n
n + 1
y − ¯y
s
,
where y is a future observation, has Student’s t-distribution
with n − 1 degrees of freedom (is a pivot).

Terminology
General scheme (1)
I will only talk about predicting one future scalar
observation (uncontroversial part).
General scheme (at least in Fisher’s work): we combine
the past observations Ypast and future observation Y to
obtain a pivot U:
U := Q(Ypast
, Y).
The distribution of U is independent of the parameter θ.
Without loss of generality we can assume that U is
uniformly distributed in [0, 1] (at least when it is
continuous): if not, replace U by FU(U), where FU is U’s
distribution function.

Terminology
General scheme (2)
If y ∈ R and Q(ypast, y) is increasing in y, y → F(ypast, y) is
a distribution function.
It is ﬁducial (predictive) distribution.
In our example: the pivot
t =
n
n + 1
y − ¯y
s
becomes the ﬁducial (predictive) distribution (function)
Q(ypast
, y) := Ft
n−1
n
n + 1
y − ¯y
s
,
Ft
n−1 being the distribution function of Student’s
t-distribution with n − 1 degrees of freedom.

Terminology
General scheme (3)
Summary:
fiducial predictive distribution = uniform pivot
(provided the pivot is a distribution function).
Notice: no explicit “fiducial inversion” is needed in this
exposition (unless we want prediction regions).
Fisher emphasized both continuity (in many publications)
and monotonicity (in 1962 in letters to Barnard and Sprott,
for parametric fiducial inference).

Terminology
In Fisher’s own words (in a non-predictive context)
His letter to Barnard in March (?) 1962:
A pivotal quantity is a function of parameters and statistics, the
distribution of which is independent of all parameters. To be of
any use in deducing probability statements about parameters,
let me add
(a) it involves only one parameter,
(b) the statistics involved are jointly exhaustive for that
parameter,
(c) it varies monotonically with that parameter.
In his publications Fisher also mentions continuity (“the
observations should not be discontinuous” in the 1956 book).

Terminology
Unpleasant possibility
There is no guarantee that Q(ypast, −∞) = 0 and
Q(ypast, ∞) = 1 is satisﬁed for all ypast.
For example, Q(ypast, −∞) > 0 means that there is a
positive mass at −∞.
We might want to disallow this.

Terminology
Fiducial prediction: existing terminology
fiducial distribution for a future observation (Fisher, 1935)
fiducial distribution (Hora, Buehler, McCullagh)
fiducial predictive distribution (Dawid, Wang)
predictive fiducial distribution (Hannig, Iyer, Wang)
marginal association (Martin and Liu)
predictive confidence distribution (Schweder, Hjort)
predictive distribution (Xie, Liu, Shen)
People often avoid “fiducial” to stay away from controversy.

Terminology
The terminology of this talk
Now I prefer to talk about predictive distributions without
imposing any conditions of validity a priori.
Fisher’s probabilistic calibration Q(Ypast, Y) ∼ U may be
the most fundamental notion of validity, but we may also
want to have Q(Ypast, Y) ∼ U given a σ-algebra F.
If F is generated by Ypast, it’s the ideal situation, but only
Bayesians can have it.
There are lots of alternative deﬁnitions (“marginal
calibration”, “F-ideal calibration”,. . . ).

Terminology
Validity and efficiency fiducial prediction
Since Q(Ypast, Y) is a pivot, fiducial prediction is calibrated
in probability by definition.
The main problem is its efficiency (or sharpness). Fisher
insisted that fiducial inference should be based on
exhaustive statistics, leading to uniqueness. This part of
his programme failed.
Gneiting et al.’s paradigm:
Probabilistic forecasting has the general goal of
maximizing the sharpness of predictive distributions,
subject to calibration.

Terminology
An example of conditional probabilistic calibration (1)
Peter McCullagh: in many cases (such as linear regression),
ﬁducial prediction is probabilistically calibrated conditionally on
a non-trivial σ-algebra.
Peter McCullagh.
Fiducial prediction.
2004,
http://www.stat.uchicago.edu/˜pmcc/reports/ﬁducial.pdf
McCullagh, Vovk, Nouretdinov, Devetyarov, Gammerman.
Conditional prediction intervals for linear regression.
ICMLA 2009.

Terminology
An example of conditional probabilistic calibration (2)
Our model is yi = β xi + σξi, where xi are ﬁxed vectors and
ξi are IID with a known distribution P (does not have to be
Gaussian).
Let F be the σ-algebra of events invariant under the
transformations (y1, y2, . . . ) → (a x1 + by1, a x2 + by2, . . . ).
Then the ﬁducial predictive distribution is probabilistically
calibrated given F.

Fisher’s nonparametric ﬁducial inference
Dempster–Hill
My plan
Dempster–Hill

Dempster–Hill
Nonparametric ﬁducial prediction in Fisher’s work
It might not even exist (Teddy Seidenfeld, personal
communication at BFF 2017).
But lots of authors believe that it does (Dempster 1963,
Lane and Sudderth 1984, Hill 1992, Coolen 1998).
In his 1992 paper “Bayesian nonparametric prediction and
statistical inference”, Bruce M. Hill writes:
Note that for all three of these authors [Student, Fisher,
Dempster] the justiﬁcation for An seems to be purely
intuitive. Thus none give anything vaguely representing a
“proof” for An. . . .
In my talk at BFF4 I referred to Fisher–Dempster–Hill
(which became “Dempster–Hill” in the published paper).

Dempster–Hill
Nonparametric fiducial inference in Fisher’s work
But Fisher definitely introduced nonparametric fiducial
inference for parameters.
Fisher traced the idea back to Student (in “Student”, 1939
paper).
In the case of two observations y1 and y2 from N(µ, 1), the
probability that µ < y(1) is 1/4, the probability that
µ ∈ (y(1), y(2)) is 1/2, and the probability that µ > y(2) is 1/4.
Fisher extended this to an arbitrary sample size n and to
the pth quantile µp (dropping the Gaussian assumption).

Dempster–Hill
Predicting the third observation
Nonparametric ﬁducial prediction: started by Jeffreys
1932; predicting a third observation. Fiducial derivation:
Seidenfeld, 1995.
Fisher did not accept Jeffreys’s argument as ﬁducial;
probably because it’s blatantly discontinuous.

Dempster–Hill
Full Dempster–Hill procedure
Dempster (fiducial argument) and then Hill (Coolen’s NPI):
extended to n observations.
Hill’s definition of An: given the data y1, . . . , yn, the
probability that the next observation y falls in (y(i), y(i+1)) is
1/(n + 1), for each i = 0, . . . , n. By definition, y(0) = −∞,
and y(n+1) = ∞.
To make it into a fiducial predictive distribution, we need a
pivot.
To get a continuous pivot, let’s randomize: for τ ∼ U,
Q(ypast
, y) :=
|{i | yi < y}| + τ + τ |{i | yi = y}|
n + 1
(the last addend is to take care of possible ties).

Dempster–Hill
What the predictive distribution may look like
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
y
0.0
0.1
0.2
0.3
0.4
0.5
Q(y)

Conformal prediction
Split-conformal prediction
Extensions
My plan
Extensions

Extensions
Our setting
Limitation of the Dempster–Hill procedure (constantly
criticized on this account): it does not cover regression of
classiﬁcation; we need (x, y), not just y.
Our statistical model (for now): the observations are IID;
standard in machine learning. Each observation: pair
(x, y) (an object and its label).
There is a natural conformal pivot (uniform in [0, 1]).

Extensions
Conformity measure
A conformity measure is a function A mapping
observations (z1, . . . , zl) to conformity scores that is
equivariant: for any l and any permutation π of {1, . . . , l},
A(z1, . . . , zl) = (α1, . . . , αl) =⇒
A zπ(1), . . . , zπ(l) = απ(1), . . . , απ(l) .
Intuitively, αi measures how well zi conforms to the other
observations among z1, . . . , zl.
Usually A is built on top of some “base” algorithm. A
simple example:
αi := yi − ˆyi.

Extensions
Conformal pivot
The conformal pivot determined by a conformity measure A is
Q(y) :=
i | αy
i < αy + τ + τ i | αy
i = αy
n + 1
,
where i = 1, . . . , n and
(αy
1, . . . , αy
n, αy
) := A(z1, . . . , zn, (x, y)).
The implementation is not as easy as it looks!

Extensions
Some literature
Vladimir Vovk, Alex Gammerman, and Glenn Shafer.
Algorithmic Learning in a Random World.
Springer, New York, 2005.
Glenn Shafer and Vladimir Vovk.
A tutorial on conformal prediction.
Journal of Machine Learning Research 2008.
Most of the papers mentoned in this talk where I am a
co-author are at: http://alrw.net.

Extensions
Least Squares Predictive Machine (LSPM)
The main struggle is for the monotonicity in y. Even for
αi := yi − ˆyi, it’s not so obvious!
If ˆyi is the LS estimate based on all of z1, . . . , zl (“full
residual”), monotonicity can be violated.
If ˆyi is the LS estimate based on z1, . . . , zl with zi removed
(“deleted residual”), monotonicity can be violated.
But in an intermediate situation (“studentized residual”),
monotonicity always holds.
We can “kernelize” LSPM to cover non-linear situations.

Extensions
Efﬁciency results for LSPM
Fisher’s ideas (using exhaustive statistics leading to
uniqueness) do not seem to work.
A more promising (albeit more restrictive) idea (Burnaev,
Wasserman):
assume a statistical model under which the base algorithm
works perfectly
show that the corresponding conformal predictive
distributions also work well (so that the guaranteed validity
does not cost much).
For LSPM, the difference between the true distribution
function and the predicted one is O(n−1/2), with precise
weak convergence results.

Extensions
Asymptotic efﬁciency
Conformal predictive distributions are universally
consistent, for a suitable conformity measure.
They can be built on top of many classical universally
consistent algorithms, such as nearest neighbours.

Extensions
Exact statements and proofs
Vladimir Vovk, Jieli Shen, Valery Manokhin, and Min-ge
Xie.
Nonparametric predictive distributions based on conformal
prediction.
Machine Learning, 2019.
Vladimir Vovk, Ilia Nouretdinov, Valery Manokhin, and Alex
Gammerman.
Conformal predictive distributions with kernels.
In: Braverman’s Readings in Machine Learning, Lecture
Notes in Artiﬁcial Intelligence, 2018.
http://alrw.net, Working Paper 18.

Extensions
Split-conformal pivot (1)
Remember that full conformal predictive distributions may be
difﬁcult to compute (this depends on the conformity measure).
Let us divide the training set z1, . . . , zn into two parts:
the training set proper, z1, . . . , zm, of size m,
and the calibration set, zm+1, . . . , zn, of size n − m.

Extensions
Split-conformal pivot (2)
The split-conformal pivot for a test object x is
Q(y) :=
|{i | αi < α}| + τ + τ |{i | αi = α}|
n − m + 1
,
where i ranges over m + 1, . . . , n and
αi := A(zi; z1, . . . , zm), α := A((x, y); z1, . . . , zm),
where there are no restrictions on the split-conformity
measure A.

Extensions
Discussion
Split-conformal predictive distributions are computationally
efﬁcient but may lose predictive efﬁciency as compared
with full conformal predictive distributions, which use the
full training set as both training set proper and calibration
set.
Way out: divide the training set into a number of folds (as
in cross-validation) and use each fold in turn as calibration
set.
The resulting cross-conformal predictive distribution loses
guaranteed validity but is well-calibrated in practice (unless
the base algorithm is wildly randomized).

Extensions
Added flexibility
Problem with the conformity measure αi := yi − ˆyi: it
implicitly assumes homoscedasticity.
We can use a more flexible base algorithm, but then face
difficult calculations for full conformal predictive
distributions.
Or we can use the split-conformal method, which is trivial;
just make sure to make A((x, y); z1, . . . , zm) an increasing
function of y.

Extensions
Conformalizing predictive distributions
In particular, we can take
A((x, y); z1, . . . , zm) := F(y),
where F is a standard predictive distribution function for
the label of x computed from z1, . . . , zm as training set,
such as Nadaraya–Watson.
The resulting predictive distributions are probabilistically
calibrated under the IID assumption.
Therefore, “conformalizing” is a way of calibrating
predictive distributions.
A natural version of Dempster–Hill.

Extensions
Efﬁciency
A primitive efﬁciency result (http://alrw, Working
Paper 23):
If A((x, y); . . . ) as function of y is the true distribution
function conditional on x, the difference between the true
distribution function and the predicted one is
O((n − m)−1/2), with precise weak convergence results.
Can be stated for samples of any size (non-asymptotically).
This result is also true without the IID assumption.
But without the IID assumption we lose the validity
guarantee.

Extensions
More conditional calibration?
If we have a large training set, it is natural to aim for
conditional probabilistic calibration.
If we identify object clusters of a reasonable size from the
training set proper, we can do calibration for each cluster
separately. (For example, 1000 calibration observations
will give the accuracy ≈ 1/1000 ≈ 3% in the estimates of
probability.)
We will have calibration inside each cluster, and the hope
is to approach Dawid’s full calibration.
A. Philip Dawid.
Calibration-based empirical probability (with discussion).
Annals of Statistics, 1985.

Extensions
Repetitive structures
The IID assumption (equivalent to exchangeability for an
inﬁnite sequence of observations) is a serious limitation of
conformal prediction.
Conformal prediction works in general repetitive structures
(Per Martin-Löf, Lauritzen), and there are lots of them
apart from the IID model.
For example, you can have partial exchangeability or
hypergraphical models.

Extensions
Another recent extension (1)
There is one extension that is extremely useful for practical
applications.
Rina Foygel Barber, Emmanuel J. Candes, Aaditya
Ramdas, Ryan J. Tibshirani.
Conformal Prediction Under Covariate Shift.
arXiv, April 2019.
They generalize the exchangeability assumption to weighted
exchangeability.

Extensions
Another recent extension (2)
What to do if the xi in the test set are generated from a
different distribution?
Example: a drug company looking for new drugs decides
to explore closely a speciﬁc region of the vast chemical
space of all compounds.
We still have the same distribution Y | X (the underlying
chemistry/biology) but the distribution of X changes.
Conformal prediction can be adapted to the situation
where dP /dP is known or can be estimated (where P and
P is the old and new distributions of X, respectively).

Conclusion
Key messages of this talk:
There are different notions of validity from the traditional
probabilistic calibration (some of them are stronger), and
they deserve to be studied for the modern versions of
ﬁducial prediction (Hannig, Martin,. . . ).
There are ways to extend ﬁducial prediction to
nonparametric settings, including those useful in
regression problems.
Thank you for your attention!

MUMS: Bayesian, Fiducial, and Frequentist Conference - Calibration of Probability Forecasts, Vladimir Vovk, April 29, 2019

Recommended

Recommended

More Related Content

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

MUMS: Bayesian, Fiducial, and Frequentist Conference - Calibration of Probability Forecasts, Vladimir Vovk, April 29, 2019