We elaborate on hierarchical credal sets, which are sets of probability mass functions paired with second-order distributions. A new criterion to make decisions based on these models is proposed. This is achieved by sampling from the set of mass functions and considering the Kullback-Leibler divergence from the weighted center of mass of the set. We evaluate this criterion in a simple classification scenario: the results show performance improvements when compared to a credal classifier where the second-order distribution is not taken into account.
Decision Making with Hierarchical Credal Sets (IPMU 2014)
1. Decision Making with Hierarchical Credal Sets
Alessandro Antonucci1
, Alexander Karlsson2
, and David Sundgren3
(1) IDSIA (Switzerland)
(2) University of Sk¨ovde (Sweden)
(3) Stockholm University (Sweden)
IPMU 2014, Montpellier, July 18th, 2014
2. Outline
Background on credal sets and hierarchical models
Credal sets and NOT hierarchical models
Hierarchical credal sets
Decision making with hierarchical credal sets
Application to credal classification
Conclusions and outlooks
3. Background on credal sets and hierarchical models
Model of uncertainty about variable X taking values in ΩX
Estimating the (expected) value of f : ΩX → R
Probability mass function P(X)
EP [f ] := x∈ΩX
P(x) · f (x)
Credal set K(X) (convex set of mass functions)
EK [f ] := minP(X)∈K(X) x∈ΩX
P(x) · f (x)
Hierarchical model [K(X), π(Θ)]
EK,π[f ] := ΩΘ
EPθ
[f ] · π(θ) · dθ = EPK,π
[f ]
where {Pθ(X)}θ∈ΩΘ
= K(X)
and PK,π(X) := ΩΘ
Pθ(X) · π(θ) · dθ (weighted CoM)
4. (Of course) Credal sets are not hierarchical models
Parametrization with Θ even with pure credal set K(X)
EK [f ] = EP∗ [f ] for at least a P∗
(X) ∈ K(X) [P∗
(X) = Pθ∗ (X)]
(improper) prior π(θ) = δθ,θ∗ gives EK , but only for this f !
Different priors for different f ⇒ a set of priors
A credal set over Θ: it should be vacuous K0(Θ)
Credal sets are (sort of) hierarchical models,
but a vacuous credal set should be placed on the second level
K(X) ≡ [PΘ(X), K0(Θ)]
For credal networks, this is the Cano-Cano-Moral transformation!
5. Hierarchical credal sets
Hierarchical model [PΘ(X), π(Θ)]
(hierarchical view of) credal sets [PΘ(X), K0(Θ)]
“hierarchical credal set” [PΘ(X), K (Θ)] equivalent to
K (X) = ΩΘ
Pθ(X) · π(θ) · dθ
π(Θ)∈K (Θ)
⊆ K(X)
Trade-off between realism/cautiousness and informativeness
EK [f ] ≤ EK [f ] ≤ EK,π[f ] ≤ EK [f ] ≤ EK [f ]
assuming π(Θ) ∈ K (Θ)
How to choose K (Θ)?
6. Shrinking (but not too much!)
Likelihood-based learning of CS [Cattaneo]
π(Θ) ∝ PΘ(D)
Model revision
π(Θ) → Kα(Θ) =
π (Θ)
π (θ) = 0
if π(θ) < α · π(θML)
Cope with [PΘ(X), Kα(Θ)]
Shifted Dirichlet prior [Karlsson & Sundgren]
A prior over credal sets induced by probability intervals
πs,t(Θ) ∝ n
i=1[Θi − P(xi )]sti −1
PK,π(xi ) = P(xi ) + ti [1 − n
j=1 P(xj )]
Back to an imprecise model?
Sampling from K(X) based on πs,t
7. Sampling from a credal set
A swarm of “particles” K(X) ⊃ {Pk (X)} ∼ πs,t(Θ)
Weighted sampling from polytopes as a two-step process
(i) Uniform sampling by convex combination of the vertices
(convex combination by uniform sampling from the simplex)
(ii) “Sampling from the sample”
(discrete sampling weighted by the prior)
For big swarms empirical and theoretical CoMs coincide
Heuristics to remove particles: KL distance from the CoM
8. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
9. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
10. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
11. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
12. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
13. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
14. Application to decision making
Simplest DM task: most probable state x∗
:= arg maxx P(x)
K(X): Ω∗
X = {x∗
∈ ΩX |∃P(X) ∈ K(X) : x∗
= arg maxx P(x)}
[K(X), πs,t(Θ)]: x∗
:= arg maxx PK,πs,t
(x)
Alternatively:
[K(X), πs,t(Θ)] → {Pj (X)}m
j=1
Shrink it to K (X) (heuristics)
Take the decision with K (X)
P(x1)
P(x2)
P(x3)
15. Testing the approach on a (credal) classification setup
Classification setup: class C and features F
Given an instance of the features F = ˜f , which c ∈ ΩC ?
(B) naive Bayes P(c, f) = P(c) i P(fi |c)
Decision based on P(C|˜f )
(C) naive credal K(C), P(Fi |c) learned by (local) IDM
Decision based on (outer approx of) K(C|˜f )
(H) hierarchical/credal approach on K(C|˜f )
Priors can be easily propagated (multiplied)
provided that someone assessed them
(C) and (H) are credal classifiers (more than a single class in output)
Accuracy of (B) compared with utility-based performance descriptor
for (C) and (H) [Zaffalon et al., 2014]
17. Conclusions and outlooks
A (better?) formalization of the relation between hierarchical and
imprecise-probabilistic models
Heuristics to take more informative decisions in credal networks
(provided that a prior can be assessed)
To do:
Better heuristics: finding the smallest credal set covering a given
number of particles can be done with MILP
More ambitiously: a sound approach to learn K (Θ)
Release a R package for that