A popular view of probability forecasting is that its aim is to maximize the sharpness of predictive distributions subject to their calibration (Gneiting et al., 2003+). Informally, calibration is the agreement between the predictive distributions and the observations, and its most popular formalization is calibration in probability. Sharpness refers to the concentration of the predictive distributions and does not depend on the observations. In this talk I will focus on conformal prediction, which is a method for producing provably calibrated predictive distributions, in the sense of calibration in probability and under the assumption, standard in machine learning, that the observations are produced independently from the same distribution (the IID assumption). While calibration is automatic under the IID assumption, achieving sharpness requires careful design of conformal predictors. My plan is to state asymptotic and small-sample results about their sharpness, with and without the IID assumption.
MUMS: Bayesian, Fiducial, and Frequentist Conference - Calibration of Probability Forecasts, Vladimir Vovk, April 29, 2019
1. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Nonparametric fiducial prediction
Vladimir Vovk
(based on joint work with many people)
Royal Holloway, University of London
BFF 2019, Duke University, 29 April 2019
Vladimir Vovk Nonparametric fiducial prediction 1
2. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
This talk
This talk: in one of the Fs in BFF (fiducial). (At least this is
my interpretation.)
It is mostly about fiducial prediction.
Moreover: it is mostly about nonparametric fiducial
prediction.
I will start from history.
Vladimir Vovk Nonparametric fiducial prediction 2
3. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
My plan
1 Parametric fiducial prediction
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
2 Dempster–Hill procedure
3 Conformal predictive distributions
Vladimir Vovk Nonparametric fiducial prediction 3
4. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
Fisher’s publications
It appears that Fisher has only two publications that discuss
fiducial prediction:
R. A. Fisher.
Fiducial argument in statistical inference.
Annals of Eugenics, 1935.
R. A. Fisher.
Statistical Methods and Scientific Inference.
1956 (3rd edition: 1973).
Vladimir Vovk Nonparametric fiducial prediction 4
5. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
Fisher’s 1935 example
His example from the 1935 paper: the Gaussian IID model.
After observing y1, . . . , yn (past data), compute
¯y :=
1
n
n
i=1
yi, s2
:=
1
n − 1
n
i=1
(yi − ¯y)2
.
Then
t :=
n
n + 1
y − ¯y
s
,
where y is a future observation, has Student’s t-distribution
with n − 1 degrees of freedom (is a pivot).
Vladimir Vovk Nonparametric fiducial prediction 5
6. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
General scheme (1)
I will only talk about predicting one future scalar
observation (uncontroversial part).
General scheme (at least in Fisher’s work): we combine
the past observations Ypast and future observation Y to
obtain a pivot U:
U := Q(Ypast
, Y).
The distribution of U is independent of the parameter θ.
Without loss of generality we can assume that U is
uniformly distributed in [0, 1] (at least when it is
continuous): if not, replace U by FU(U), where FU is U’s
distribution function.
Vladimir Vovk Nonparametric fiducial prediction 6
7. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
General scheme (2)
If y ∈ R and Q(ypast, y) is increasing in y, y → F(ypast, y) is
a distribution function.
It is fiducial (predictive) distribution.
In our example: the pivot
t =
n
n + 1
y − ¯y
s
becomes the fiducial (predictive) distribution (function)
Q(ypast
, y) := Ft
n−1
n
n + 1
y − ¯y
s
,
Ft
n−1 being the distribution function of Student’s
t-distribution with n − 1 degrees of freedom.
Vladimir Vovk Nonparametric fiducial prediction 7
8. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
General scheme (3)
Summary:
fiducial predictive distribution = uniform pivot
(provided the pivot is a distribution function).
Notice: no explicit “fiducial inversion” is needed in this
exposition (unless we want prediction regions).
Fisher emphasized both continuity (in many publications)
and monotonicity (in 1962 in letters to Barnard and Sprott,
for parametric fiducial inference).
Vladimir Vovk Nonparametric fiducial prediction 8
9. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
In Fisher’s own words (in a non-predictive context)
His letter to Barnard in March (?) 1962:
A pivotal quantity is a function of parameters and statistics, the
distribution of which is independent of all parameters. To be of
any use in deducing probability statements about parameters,
let me add
(a) it involves only one parameter,
(b) the statistics involved are jointly exhaustive for that
parameter,
(c) it varies monotonically with that parameter.
In his publications Fisher also mentions continuity (“the
observations should not be discontinuous” in the 1956 book).
Vladimir Vovk Nonparametric fiducial prediction 9
10. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
Unpleasant possibility
There is no guarantee that Q(ypast, −∞) = 0 and
Q(ypast, ∞) = 1 is satisfied for all ypast.
For example, Q(ypast, −∞) > 0 means that there is a
positive mass at −∞.
We might want to disallow this.
Vladimir Vovk Nonparametric fiducial prediction 10
11. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
Fiducial prediction: existing terminology
fiducial distribution for a future observation (Fisher, 1935)
fiducial distribution (Hora, Buehler, McCullagh)
fiducial predictive distribution (Dawid, Wang)
predictive fiducial distribution (Hannig, Iyer, Wang)
marginal association (Martin and Liu)
predictive confidence distribution (Schweder, Hjort)
predictive distribution (Xie, Liu, Shen)
People often avoid “fiducial” to stay away from controversy.
Vladimir Vovk Nonparametric fiducial prediction 11
12. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
The terminology of this talk
Now I prefer to talk about predictive distributions without
imposing any conditions of validity a priori.
Fisher’s probabilistic calibration Q(Ypast, Y) ∼ U may be
the most fundamental notion of validity, but we may also
want to have Q(Ypast, Y) ∼ U given a σ-algebra F.
If F is generated by Ypast, it’s the ideal situation, but only
Bayesians can have it.
There are lots of alternative definitions (“marginal
calibration”, “F-ideal calibration”,. . . ).
Vladimir Vovk Nonparametric fiducial prediction 12
13. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
Validity and efficiency fiducial prediction
Since Q(Ypast, Y) is a pivot, fiducial prediction is calibrated
in probability by definition.
The main problem is its efficiency (or sharpness). Fisher
insisted that fiducial inference should be based on
exhaustive statistics, leading to uniqueness. This part of
his programme failed.
Gneiting et al.’s paradigm:
Probabilistic forecasting has the general goal of
maximizing the sharpness of predictive distributions,
subject to calibration.
Vladimir Vovk Nonparametric fiducial prediction 13
14. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
An example of conditional probabilistic calibration (1)
Peter McCullagh: in many cases (such as linear regression),
fiducial prediction is probabilistically calibrated conditionally on
a non-trivial σ-algebra.
Peter McCullagh.
Fiducial prediction.
2004,
http://www.stat.uchicago.edu/˜pmcc/reports/fiducial.pdf
McCullagh, Vovk, Nouretdinov, Devetyarov, Gammerman.
Conditional prediction intervals for linear regression.
ICMLA 2009.
Vladimir Vovk Nonparametric fiducial prediction 14
15. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s fiducial prediction
Terminology
Validity of fiducial prediction
An example of conditional probabilistic calibration (2)
Our model is yi = β xi + σξi, where xi are fixed vectors and
ξi are IID with a known distribution P (does not have to be
Gaussian).
Let F be the σ-algebra of events invariant under the
transformations (y1, y2, . . . ) → (a x1 + by1, a x2 + by2, . . . ).
Then the fiducial predictive distribution is probabilistically
calibrated given F.
Vladimir Vovk Nonparametric fiducial prediction 15
17. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s nonparametric fiducial inference
Dempster–Hill
Nonparametric fiducial prediction in Fisher’s work
It might not even exist (Teddy Seidenfeld, personal
communication at BFF 2017).
But lots of authors believe that it does (Dempster 1963,
Lane and Sudderth 1984, Hill 1992, Coolen 1998).
In his 1992 paper “Bayesian nonparametric prediction and
statistical inference”, Bruce M. Hill writes:
Note that for all three of these authors [Student, Fisher,
Dempster] the justification for An seems to be purely
intuitive. Thus none give anything vaguely representing a
“proof” for An. . . .
In my talk at BFF4 I referred to Fisher–Dempster–Hill
(which became “Dempster–Hill” in the published paper).
Vladimir Vovk Nonparametric fiducial prediction 17
18. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s nonparametric fiducial inference
Dempster–Hill
Nonparametric fiducial inference in Fisher’s work
But Fisher definitely introduced nonparametric fiducial
inference for parameters.
Fisher traced the idea back to Student (in “Student”, 1939
paper).
In the case of two observations y1 and y2 from N(µ, 1), the
probability that µ < y(1) is 1/4, the probability that
µ ∈ (y(1), y(2)) is 1/2, and the probability that µ > y(2) is 1/4.
Fisher extended this to an arbitrary sample size n and to
the pth quantile µp (dropping the Gaussian assumption).
Vladimir Vovk Nonparametric fiducial prediction 18
19. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s nonparametric fiducial inference
Dempster–Hill
Predicting the third observation
Nonparametric fiducial prediction: started by Jeffreys
1932; predicting a third observation. Fiducial derivation:
Seidenfeld, 1995.
Fisher did not accept Jeffreys’s argument as fiducial;
probably because it’s blatantly discontinuous.
Vladimir Vovk Nonparametric fiducial prediction 19
20. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s nonparametric fiducial inference
Dempster–Hill
Full Dempster–Hill procedure
Dempster (fiducial argument) and then Hill (Coolen’s NPI):
extended to n observations.
Hill’s definition of An: given the data y1, . . . , yn, the
probability that the next observation y falls in (y(i), y(i+1)) is
1/(n + 1), for each i = 0, . . . , n. By definition, y(0) = −∞,
and y(n+1) = ∞.
To make it into a fiducial predictive distribution, we need a
pivot.
To get a continuous pivot, let’s randomize: for τ ∼ U,
Q(ypast
, y) :=
|{i | yi < y}| + τ + τ |{i | yi = y}|
n + 1
(the last addend is to take care of possible ties).
Vladimir Vovk Nonparametric fiducial prediction 20
21. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Fisher’s nonparametric fiducial inference
Dempster–Hill
What the predictive distribution may look like
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
y
0.0
0.1
0.2
0.3
0.4
0.5
Q(y)
Vladimir Vovk Nonparametric fiducial prediction 21
23. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Our setting
Limitation of the Dempster–Hill procedure (constantly
criticized on this account): it does not cover regression of
classification; we need (x, y), not just y.
Our statistical model (for now): the observations are IID;
standard in machine learning. Each observation: pair
(x, y) (an object and its label).
There is a natural conformal pivot (uniform in [0, 1]).
Vladimir Vovk Nonparametric fiducial prediction 23
24. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Conformity measure
A conformity measure is a function A mapping
observations (z1, . . . , zl) to conformity scores that is
equivariant: for any l and any permutation π of {1, . . . , l},
A(z1, . . . , zl) = (α1, . . . , αl) =⇒
A zπ(1), . . . , zπ(l) = απ(1), . . . , απ(l) .
Intuitively, αi measures how well zi conforms to the other
observations among z1, . . . , zl.
Usually A is built on top of some “base” algorithm. A
simple example:
αi := yi − ˆyi.
Vladimir Vovk Nonparametric fiducial prediction 24
25. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Conformal pivot
The conformal pivot determined by a conformity measure A is
Q(y) :=
i | αy
i < αy + τ + τ i | αy
i = αy
n + 1
,
where i = 1, . . . , n and
(αy
1, . . . , αy
n, αy
) := A(z1, . . . , zn, (x, y)).
The implementation is not as easy as it looks!
Vladimir Vovk Nonparametric fiducial prediction 25
26. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Some literature
Vladimir Vovk, Alex Gammerman, and Glenn Shafer.
Algorithmic Learning in a Random World.
Springer, New York, 2005.
Glenn Shafer and Vladimir Vovk.
A tutorial on conformal prediction.
Journal of Machine Learning Research 2008.
Most of the papers mentoned in this talk where I am a
co-author are at: http://alrw.net.
Vladimir Vovk Nonparametric fiducial prediction 26
27. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Least Squares Predictive Machine (LSPM)
The main struggle is for the monotonicity in y. Even for
αi := yi − ˆyi, it’s not so obvious!
If ˆyi is the LS estimate based on all of z1, . . . , zl (“full
residual”), monotonicity can be violated.
If ˆyi is the LS estimate based on z1, . . . , zl with zi removed
(“deleted residual”), monotonicity can be violated.
But in an intermediate situation (“studentized residual”),
monotonicity always holds.
We can “kernelize” LSPM to cover non-linear situations.
Vladimir Vovk Nonparametric fiducial prediction 27
28. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Efficiency results for LSPM
Fisher’s ideas (using exhaustive statistics leading to
uniqueness) do not seem to work.
A more promising (albeit more restrictive) idea (Burnaev,
Wasserman):
assume a statistical model under which the base algorithm
works perfectly
show that the corresponding conformal predictive
distributions also work well (so that the guaranteed validity
does not cost much).
For LSPM, the difference between the true distribution
function and the predicted one is O(n−1/2), with precise
weak convergence results.
Vladimir Vovk Nonparametric fiducial prediction 28
29. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Asymptotic efficiency
Conformal predictive distributions are universally
consistent, for a suitable conformity measure.
They can be built on top of many classical universally
consistent algorithms, such as nearest neighbours.
Vladimir Vovk Nonparametric fiducial prediction 29
30. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Exact statements and proofs
Vladimir Vovk, Jieli Shen, Valery Manokhin, and Min-ge
Xie.
Nonparametric predictive distributions based on conformal
prediction.
Machine Learning, 2019.
Vladimir Vovk, Ilia Nouretdinov, Valery Manokhin, and Alex
Gammerman.
Conformal predictive distributions with kernels.
In: Braverman’s Readings in Machine Learning, Lecture
Notes in Artificial Intelligence, 2018.
http://alrw.net, Working Paper 18.
Vladimir Vovk Nonparametric fiducial prediction 30
31. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Split-conformal pivot (1)
Remember that full conformal predictive distributions may be
difficult to compute (this depends on the conformity measure).
Let us divide the training set z1, . . . , zn into two parts:
the training set proper, z1, . . . , zm, of size m,
and the calibration set, zm+1, . . . , zn, of size n − m.
Vladimir Vovk Nonparametric fiducial prediction 31
32. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Split-conformal pivot (2)
The split-conformal pivot for a test object x is
Q(y) :=
|{i | αi < α}| + τ + τ |{i | αi = α}|
n − m + 1
,
where i ranges over m + 1, . . . , n and
αi := A(zi; z1, . . . , zm), α := A((x, y); z1, . . . , zm),
where there are no restrictions on the split-conformity
measure A.
Vladimir Vovk Nonparametric fiducial prediction 32
33. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Discussion
Split-conformal predictive distributions are computationally
efficient but may lose predictive efficiency as compared
with full conformal predictive distributions, which use the
full training set as both training set proper and calibration
set.
Way out: divide the training set into a number of folds (as
in cross-validation) and use each fold in turn as calibration
set.
The resulting cross-conformal predictive distribution loses
guaranteed validity but is well-calibrated in practice (unless
the base algorithm is wildly randomized).
Vladimir Vovk Nonparametric fiducial prediction 33
34. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Added flexibility
Problem with the conformity measure αi := yi − ˆyi: it
implicitly assumes homoscedasticity.
We can use a more flexible base algorithm, but then face
difficult calculations for full conformal predictive
distributions.
Or we can use the split-conformal method, which is trivial;
just make sure to make A((x, y); z1, . . . , zm) an increasing
function of y.
Vladimir Vovk Nonparametric fiducial prediction 34
35. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Conformalizing predictive distributions
In particular, we can take
A((x, y); z1, . . . , zm) := F(y),
where F is a standard predictive distribution function for
the label of x computed from z1, . . . , zm as training set,
such as Nadaraya–Watson.
The resulting predictive distributions are probabilistically
calibrated under the IID assumption.
Therefore, “conformalizing” is a way of calibrating
predictive distributions.
A natural version of Dempster–Hill.
Vladimir Vovk Nonparametric fiducial prediction 35
36. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Efficiency
A primitive efficiency result (http://alrw, Working
Paper 23):
If A((x, y); . . . ) as function of y is the true distribution
function conditional on x, the difference between the true
distribution function and the predicted one is
O((n − m)−1/2), with precise weak convergence results.
Can be stated for samples of any size (non-asymptotically).
This result is also true without the IID assumption.
But without the IID assumption we lose the validity
guarantee.
Vladimir Vovk Nonparametric fiducial prediction 36
37. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
More conditional calibration?
If we have a large training set, it is natural to aim for
conditional probabilistic calibration.
If we identify object clusters of a reasonable size from the
training set proper, we can do calibration for each cluster
separately. (For example, 1000 calibration observations
will give the accuracy ≈ 1/1000 ≈ 3% in the estimates of
probability.)
We will have calibration inside each cluster, and the hope
is to approach Dawid’s full calibration.
A. Philip Dawid.
Calibration-based empirical probability (with discussion).
Annals of Statistics, 1985.
Vladimir Vovk Nonparametric fiducial prediction 37
38. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Repetitive structures
The IID assumption (equivalent to exchangeability for an
infinite sequence of observations) is a serious limitation of
conformal prediction.
Conformal prediction works in general repetitive structures
(Per Martin-Löf, Lauritzen), and there are lots of them
apart from the IID model.
For example, you can have partial exchangeability or
hypergraphical models.
Vladimir Vovk Nonparametric fiducial prediction 38
39. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Another recent extension (1)
There is one extension that is extremely useful for practical
applications.
Rina Foygel Barber, Emmanuel J. Candes, Aaditya
Ramdas, Ryan J. Tibshirani.
Conformal Prediction Under Covariate Shift.
arXiv, April 2019.
They generalize the exchangeability assumption to weighted
exchangeability.
Vladimir Vovk Nonparametric fiducial prediction 39
40. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conformal prediction
Split-conformal prediction
Extensions
Another recent extension (2)
What to do if the xi in the test set are generated from a
different distribution?
Example: a drug company looking for new drugs decides
to explore closely a specific region of the vast chemical
space of all compounds.
We still have the same distribution Y | X (the underlying
chemistry/biology) but the distribution of X changes.
Conformal prediction can be adapted to the situation
where dP /dP is known or can be estimated (where P and
P is the old and new distributions of X, respectively).
Vladimir Vovk Nonparametric fiducial prediction 40
41. Parametric fiducial prediction
Dempster–Hill procedure
Conformal predictive distributions
Conclusion
Key messages of this talk:
There are different notions of validity from the traditional
probabilistic calibration (some of them are stronger), and
they deserve to be studied for the modern versions of
fiducial prediction (Hannig, Martin,. . . ).
There are ways to extend fiducial prediction to
nonparametric settings, including those useful in
regression problems.
Thank you for your attention!
Vladimir Vovk Nonparametric fiducial prediction 41