Here the interest is mainly to compute characterisations like the entropy,
the Kullback-Leibler divergence, more general $f$-divergences, or other such characteristics based on
the probability density. The density is often not available directly,
and it is a computational challenge to just represent it in a numerically
feasible fashion in case the dimension is even moderately large. It
is an even stronger numerical challenge to then actually compute said characteristics
in the high-dimensional case.
The task considered here was the numerical computation of characterising statistics of
high-dimensional pdfs, as well as their divergences and distances,
where the pdf in the numerical implementation was assumed discretised on some regular grid.
We have demonstrated that high-dimensional pdfs,
pcfs, and some functions of them
can be approximated and represented in a low-rank tensor data format.
Utilisation of low-rank tensor techniques helps to reduce the computational complexity
and the storage cost from exponential $\C{O}(n^d)$ to linear in the dimension $d$, e.g.\
$O(d n r^2 )$ for the TT format. Here $n$ is the number of discretisation
points in one direction, $r<<n$ is the maximal tensor rank, and $d$ the problem dimension.
1. Computing f-Divergences and Distances of
High-Dimensional Probability Density Functions
Alexander Litvinenko (RWTH Aachen, Germany),
joint work with
Youssef Marzouk, Hermann G. Matthies, Marco Scavino, Alessio Spantini
2. Plan
1. Motivating examples, working diagram
2. Basics: pdf, pcf, FFT
3. Theoretical background
4. Computation of moments and divergences
5. Tensor formats
6. Algorithms
7. Numerics
1 / 43
3. Motivation: How to compute in high dimensions?
Let ξ ∈ Rd
be a random vector ξ = (ξ1, ..., ξd) with pdf pξ.
Entropy is the expectation of the negative logarithm of the pdf :
h(pξ) := E (− ln(pξ(y))) :=
Z
Rd
− ln(pξ(y))pξ(y) dy. (1)
How to compute KLD and other divergences? Classical result is
given only for pdfs, which are usually unknown.
Divergence D•(pkq)
KLD
Z
(log(p(x)/q(x))) p(x) dx
Hellinger dist.
1
2
Z q
p(x) −
q
q(x)
2
dx
Bhattacharyya − log
Z q
(p(x)q(x)) dx
2 / 43
4. Motivation: stochastic PDEs
−∇ · (κ(x, ω)∇u(x, ω)) = f (x, ω), x ∈ G ⊂ Rd
, ω ∈ Ω
Write first Karhunen-Loeve Expansion and then for uncorrelated
random variables the Polynomial Chaos Expansion
u(x, ω) =
K
X
i=1
p
λiφi(x)ξi(ω) =
K
X
i=1
p
λiφi(x)
X
α∈J
ξ
(α)
i Hα(θ(ω))
(2)
=
K
X
i=1
p
λiφi(x)
p1
X
α1=1
...
pM
X
αM=1
ξ
(α1,...,αM)
i
M
Y
j=1
hαj
(θj) (3)
where α = (α1, α2, ..., αM, ...).
How to compute pdf of u(x, ω) from Eq. 2 ?
3 / 43
5. How to compute KLD, divergences if pdf is not available?
Two ways to compute f -divergence, KLD, entropy.
4 / 43
6. Connection of pcf and pdf
The probability characteristic function (pcf ) ϕξ defined:
ϕξ(t) := E (exp(ihξ|ti)) :=
Z
Rd
pξ(y) exp(ihy|ti) dy =: F[d]
(pξ)(t),
where t = (t1, t2, ..., td) ∈ Rd
,
hy|ti =
Pd
j=1 yjtj, and
F[d]
is the probabilist’s d-dimensional Fourier transform.
pξ(y) =
1
(2π)d
Z
Rd
exp(−iht|yi)ϕξ(t) dt = F[−d]
(ϕξ)(y). (4)
5 / 43
7. Discrete low-rank representation of pcf
We try to find an approximation
ϕξ(t) ≈ e
ϕξ(t) =
R
X
`=1
d
O
ν=1
ϕ`,ν(tν), (5)
where the ϕ`,ν(tν) are one-dimensional functions.
Then we can get
pξ(y) ≈ e
pξ(y) = F[−d]
(e
ϕξ)y =
R
X
`=1
d
O
ν=1
F−1
1 (ϕ`,ν)(yν),
where F−1
1 is the one-dimensional inverse Fourier transform.
6 / 43
8. Discrete representation of the pdf
Discrete representation of pdf and the pcf is based on equidistant
grid vectors
x̂iν,ν = x̂1,ν + (iν − 1)∆xν
(with increment ∆xν
) of size Mν in each dimension 1 ≤ ν ≤ d of Rd
.
V =
Qd
ν=1 Mν∆xν
, trapezoidal integration rule with weights V
N
.
The whole grid is
X̂ =
d
×
ν=1
x̂ν
P := pξ(X̂) denotes the tensor P ∈
Nd
ν=1 RMν
=: T ,
dim T =
Qd
ν=1 Mν =: N,
the components of which are the evaluation of the pdf pξ on the
grid X̂.
7 / 43
9. Discrete representation of the pcf
Dual grid
T̂ =
d
×
ν=1
t̂ν
t̂ν := (ˆ
t1,ν, . . . , ˆ
tMν,ν), ˆ
tMν,ν = π/∆xν
, the equi-distant spacing of the
dual grid in dimension ν is 2π/Lν..
0 ∈ T̂, j0
= (j0
1 , . . . , j0
d ), i.e. (ˆ
tj0
1 ,1, . . . , ˆ
tj0
d ,d) = 0 = (0, . . . , 0).
The pcf on the dual grid is represented through the tensor
Φ := φξ(T̂) ∈ T .
Thus, we deal with
P := pξ(X̂) of the pdf
and Φ := φξ(T̂) of the pcf .
8 / 43
10. Notation, Moments and Covariance
x ∈ Rd
, x⊗k
=
Nk
j=1 x.
RV ξ : Ω → V = Rd
.
Expectation operator is denoted by E (·), the mean is
ξ̄ := E (ξ) =
R
Ω ξ(ω) P(dω) ∈ Rd
,
ξ̃ := ξ − ξ̄.
The moments Xk and the central moments Ξk of ξ of order k:
Xk = E ξ⊗k
∈ (Rd
)⊗k
, Ξk = E
ξ̃⊗k
∈ (Rd
)⊗k
.
The covariance matrix Σξ = cov ξ = Ξ2 = X2 − ξ̄ ⊗ ξ̄ ∈ (Rd
)⊗2
.
The mixed and mixed central moments are denoted by
Y k,` = E
ξ⊗k
⊗ η⊗`
and Υk,` = E
ξ̃⊗k
⊗ η̃⊗`
∈ (Rd
)⊗k
⊗(Rn
)⊗
The covariance is also denoted as cov(ξ, η) = Υ1,1 = Y 1,1 − ξ̄ ⊗ η̄.
9 / 43
11. Higher order moments
(−i ∂tk
) ϕξ(t) =
R
Rd xk exp(iht|xi)pξ(x) dx = F[d]
(xkpξ(x)) (t).
Further, denoting the tensor of k-th derivatives by
Dk
ϕξ(t) =
∂k
∂ti1
... ∂tik
ϕξ(t)
, and
(−i)k
Dk
ϕξ(0) =
Z
Rd
x⊗k
pξ(x) dx = F[d]
x⊗k
pξ(x)
(0) = Xk
(6)
Second characteristic function (cumulant generating function) whose
derivative tensors of order k are essentially the cumulants Kk of ξ, is
defined as the point-wise logarithm of the pcf :
χξ(t) := log(ϕξ(t)) = log (E (exp(iht|ξi))) , (7)
with (−i)k
Dk
χξ(0) =: Kk.
10 / 43
12. Moment generating function
Is defined as:
Mξ(t) := E (exp(ht|ξi)) =
Z
Rd
exp(ht|xi)pξ(x) dx
= Ld(pξ)(−t) = ϕξ(−i t),
where Ld(pξ)(t) =
R
exp(h−t|xi)pξ(x) dx is the two-sided
d-dimensional Laplace transform of pξ. Then
Dk
Mξ(0) =
Z
Rd
x⊗k
pξ(x) dx = Xk, k ∈ N0. (8)
Cumulant generating function is the point-wise logarithm of the
moment generating function Mξ:
Kξ(t) := log(Mξ(t)) = log (E (exp(ht|ξi))) , (9)
with Dk
Kξ(0) = Kk.
11 / 43
13. Representation of a 3D tensor in the CP tensor format
A full tensor w ∈ Rn1×n2×n3
is represented as a sum of tensor
products.
The lines on the right denote vectors wi,k ∈ Rnk
, i = 1, . . . , r,
k = 1, 2, 3.
12 / 43
14. Computation of QoIs
For tensors P, F representing pdf p(x) and a function f (·) evaluated
on the grid, obtain
Z
p(x) dx ≈ S(P) :=
V
N
hP|iT , (10)
where = (1i1,...,id
) — the tensor with all ones — satisfying
r = r for any r ∈ T .
E (f (ξ)) =
Z
Rd
f (x)pξ(x) dx ≈ S(F P) =
V
N
hF|PiT
13 / 43
15. Computation of QoIs
Differential entropy, requiring the point-wise logarithm of P:
h(p) := E (− log(p))p ≈ E (− log(P))P = −
V
N
hlog(P)|Pi,
Then the f -divergence of p from q and its discrete approximation is
defined as
Df (pkq) := E
f
p
q
q
≈ E f (P Q −1
)
Q
=
V
N
hf (P Q −1
)|Qi.
14 / 43
16. List of some typical divergences and distances.
Divergence D•(pkq)
KLD — DKL:
Z
(log(p(x)/q(x))) p(x) dx = Ep(log(p/q))
Hellinger,(DH)2
:
1
2
Z q
p(x) −
q
q(x)
2
dx
Bregman, Dφ:
Z
[(φ(p(x)) − φ(q(x))) − (p(x) − q(x))φ0
(q(x))] dx
Bhattach., DBh: − log
Z q
(p(x)q(x)) dx
15 / 43
17. Discrete approximations for divergences above
Divergence Approx. D•(pkq)
KLD
V
N
(hlog(P)|Pi − hlog(Q)|Pi)
(DH)2
:
V
2N
hP 1/2
− Q 1/2
|P 1/2
− Q 1/2
i
Dφ: S ((φ(P) − φ(Q)) − (P − Q) φ0
(Q))
DBh: − log
V
N
hP 1/2
|Q 1/2
i
16 / 43
18. Algorithms
Below we will list algorithms, which approximate f (pξ(y)) by f (P),
where the f ’s considered are
f (·) = {sign(·), (·)−1
,
√
·, m
√
·, (·)k
, log (·), exp (·), (·)2
, | · |}, (11)
k 0,
P = pξ(X̂) =
Prp
j=1
Nd
ν=1 pj,ν.
Available methods:
1. TT-cross
2. iterative methods (e.g., Newton algorithm)
3. power series
4. quadrature rule to compute the Dunford-Cauchy contour integral
5. others (like sinc quadrature)
17 / 43
19. Iterative methods
We want to compute f (w) for some function f : T → T .
We have an iteration function Ψf ,
which only uses operations from the Hadamard algebra on T , and
which is iterated,
vi+1 = Ψf (vi)
and converges to a fixed point
Ψf (v∗) = v∗
When started with a v0 depending on w,
the fixed point is
lim
i→∞
vi = v∗ = Ψf (v∗) = f (w)
18 / 43
20. Computing pointwise inverse w −1
.
Let F(x) := w − x −1
.
Applying Newton’s method to F(x) for approximating the inverse of
a given tensor w, one obtains the following iteration function Ψ −1
with the i.c. v0 = α · w to bring v0 close to va = :
Ψ −1(v) = v (2 · − w v).
The iteration converges if the initial iterate v0 satisfies
k − w v0k∞ 1.
A possible candidate for the starting value is v0 = αw with
α (1/kwk∞)2
.
For such a v0, the convergence initial condition k − αw 2
k∞ 1 is
always satisfied.
19 / 43
21. Computing pointwise
√
w via Newton iteration
Let F(x) := x 2
− w = 0.
The Newton iteration
Ψ√(v) =
1
2
· (v + v −1
w). (12)
with i.c. v0 = (w + )/2.
20 / 43
22. Computing pointwise
√
w via Newton-Schulz iteration
Let F(x) := x 2
− w = 0.
An alternative is Newton-Schulz iteration, which computes
v+
∗ =
√
w = w 1/2
and v−
∗ = (
√
w) −1
= w −1/2
.
We set V 0 = [y0, z0] = [α · w, ] ∈ T 2
, and the auxiliary function
A(y, z) = 3 · − z y:
Ψ√
y
z
=
1
2
y A(y, z)
A(y, z) z
. (13)
The iteration converges to V ∗ = [v+
∗ , v−
∗ ] = [
√
y0, (
√
y0) −1
] if
k − y0k∞ 1, which can be achieved with a scaling factor
α 1/kwk∞.
As the initial iterate was scaled, the fixed point of the iteration is
v+
∗ =
√
α ·
√
w and v−
∗ = (1/
√
α) · (
√
w) −1
.
Thus the final result is
√
w = (1/
√
α)·v+
∗ and (
√
w) −1
=
√
α ·v−
∗ .
21 / 43
23. Computing log(w)
Assume w 0.
See review of methods in Higham’01 and Higham’12.
For the algorithms to work well w has to be close to the identity ,
which can be achieved by taking roots: for λ 0 one has
log wλ
= λ log w.
Truncated Taylor series (radius of convergence kxk∞ 1):
log( − x) = −
∞
X
n=1
1
n
· x n
where x := − w. If w is not near to the identity, then one may
use the relation log(w) = 2k
log(w 1/2k
), where w 1/2k
→ as k
increases.
22 / 43
24. Computing power function by w 7→ w m
Denoting the power function by w 7→ w m
= Ψpow(m, w), one also
wants to use it for negative powers; for m 0 this is simply
Ψpow(m, w) = Ψpow(−m, w −1
).
The recursive formula:
Ψpow(m, w) =
m 1 and odd : w Ψpow(m − 1, w);
m even : Ψpow(m
2 , w) Ψpow(m
2 , w);
m = 1 : w;
(14)
23 / 43
25. Computing w
1
m
See Section 7 in N.Higham’s book.
Assume w ≥
Newton’s method for F(x) = x m
− w = . The iteration function
with v0 = w looks like
Ψm−root(v) =
1
m
((m − 1) · v + Ψpow(1 − m, v) v0) . (15)
If m ≥ 2, this involves a negative power v (1−m)
= Ψpow(1 − m, v).
Algorithm converges for all w ≥ .
24 / 43
26. Computing w
1
m
Auxiliary function A(y, z) = (1/m) · ((m + 1) · − z):
Ψm−root =
y
z
=
y A(y, z)
Ψpow(m, A(y, z)) z
, (16)
where yi → w − 1
m and zi → w
1
m .
The starting values are V 0 = [y0, z0] = [α · , (α)m
w] ∈ T 2
,
with α (kwk∞/
√
2)− 1
m .
For scaling purposes it is best used with m = 2k
.
25 / 43
27. Computing w
1
m
Another way of computing the m-th root is Tsai’s algorithm(Tsai’88,
Lorin’21), which uses the auxiliary function
B(y) = (2 · + (m − 2) · y) ( + (m − 1) · y) −1
:
ΨTsai =
y
z
=
y Ψpow(m, B(y))
z (B(y))
, (17)
with starting value V 0 = [w, ]. Then zi → w
1
m .
26 / 43
28. Computing log(w) via Gregory’s series
Converges for all w 0.
Setting z = ( − w) ( + w) −1
, one has
log w = −2
∞
X
k=0
1
2k + 1
· z (2k+1)
. (18)
27 / 43
29. Computing exp w.
See book of N. Higham, Chapter 10:
ur,s =
r
X
k=0
1
k!sk
w k
! s
. (19)
Here limr→∞ ur,s = lims→∞ ur,s = exp w.
It is of advantage to use s from the series of powers of 2,
s = 1, 2, 4, . . . , 2k
,
then the s-th power can be computed by squaring.
For the scaling the best choice is α kwk∞.
28 / 43
30. Series of numerical examples
1. KLD is computed with the analytical formula and the amen cross
algorithm from TT-toolbox
2. Hellinger distances is computed with well-known analytical
formulas and the amen cross algorithm.
3. (pdf is not known analytically), the d-variate elliptically contoured
α-stable distributions are chosen and accessed via their pcfs ,
4. KLD and Hellinger distances for different value of d, n and the
parameter α.
29 / 43
31. Example 1: KLD for two Gaussian distributions
N1 := N(µ1, C1) and N2 := N(µ2, C2), where
C1 := σ2
1I, C2 := σ2
2I,
µ1 = (1.1 . . . , 1.1) and µ2 = (1.4, . . . , 1.4) ∈ Rd
,
d = {16, 32, 64}, and σ1 = 1.5, σ2 = 22.1.
The well known analytical formula is
2DKL(N1kN2) = tr(C−1
2 C1) + (µ2 − µ1)T
C−1
2 (µ2 − µ1) − d + log
|C2|
|C1|
30 / 43
32. Comparison of KLDs computed via two methods
DKL computed via TT tensors (AMEn algorithm) and the analytical
formula for various values of d.
TT tolerance = 10−6
, the stopping difference between consecutive
iterations.
d 16 32 64
n 2048 2048 2048
DKL (exact) 35.08 70.16 140.32
e
DKL 35.08 70.16 140.32
erra 4.0e-7 2.43e-5 1.4e-5
errr 1.1e-8 3.46e-8 8.1e-8
comp. time, sec. 1.0 5.0 18.7
31 / 43
34. Hellinger distance DH
is computed via TT tensors (AMEn) and the analytical formula. TT
tolerance = 10−6
.
d 16 32 64
n 2048 2048 2048
DH (exact) 0.99999 0.99999 0.99999
e
DH 0.99992 0.99999 0.99999
erra 3.5e-5 7.1e-5 1.4e-4
errr 2.5e-5 5.0e-5 1.0e-4
comp. time, sec. 1.7 7.5 30.5
The AMEn algorithm is able to compute the Hellinger distance DH
between two multiv. Gaussian distribs for d = {16, 32, 64}, and
n = 2048. The exact and approximate values are identical, and the
error is small.
33 / 43
35. Example 3: α-stable distribution
The pcf of a d-variate elliptically contoured α-stable distribution is
given by
ϕξ(t) = exp
iht|µi − ht|Cti
α
2
.
AMEn tol.= 10−9
.
34 / 43
37. Full storage vs low-rank
d = 32 and n = 256 mean that the amount of data in full storage
mode would be
N = nd
= 26532
≈ 1.16E77 ≈ 1E78 bytes.
In TT-low-rank approximation it is ca. 626MB, and fits on a laptop.
Assuming 1GHz notebook, the KLD computation in full mode would
require ca. 1.2E68sec,
or more than 3E60 years,
and even with a perfect speed-up on a parallel super-computer with
say 1, 000, 000 processors,
this would require still more than 3E54 years;
compare this with the estimated age of the universe of ca. 1.4E10
years.
36 / 43
38. Example 4: DKL(α1, α2) between two α-stable distributions
for (α1, α2) and fixed d = 8 and n = 64.
(α1, α2) (2.0, 0.5) (2.0, 1.0) (2.0, 1.5) (2.0, 1.9) (1.5, 1.4) (1.0, 0.4)
DKL(α1, α2) 2.27 0.66 0.3 0.03 0.031 0.6
comp. time, sec. 8.4 7.8 7.5 8.5 11 8.7
max. TT rank 78 74 76 76 80 79
memory, MB 28.5 28.5 27.1 28.5 35 29.5
µ1 = µ2 = 0, C1 = C2 = I.
AMEn tol.= 10−12
.
37 / 43
39. Example 3: Hellinger distance DH(α1, α2)
for the d-variate elliptically contoured α-stable distribution for
α = 1.5 and α = 0.9 for different d and n.
µ1 = µ2 = 0, C1 = C2 = I.
d 16 16 16 16 16 16 32 32 32 32
n 8 16 32 64 128 256 16 32 64 128
DH(1.5, 0.9) 0.218 0.223 0.223 0.223 0.219 0.223 0.180 0.176 0.175 0.176
comp. time, sec. 2.8 3.7 7.5 19 53 156 11 21 62 117
max. TT rank 79 76 76 76 79 76 75 71 75 74
memory, MB 7.7 17 34 71 145 283 34 66 144 285
AMEn tolerance is 10−9
.
38 / 43
40. Example 6: DH vs. TT (AMEn) tolerances
TT(AMEn) tolerance 10−7
10−8
10−9
10−10
10−14
DH(1.5, 0.9) 0.1645 0.1817 0.176 0.1761 0.1802
comp. time, sec. 43 86 103 118 241
max. TT rank 64 75 75 78 77
memory, MB 126 255 270 307 322
Computation of DH(α1, α2) between two α-stable distributions
(α = 1.5 and α = 0.9) for different AMEn tolerances.
n = 128, d = 32, µ1 = µ2 = 0, C1 = C2 = I.
39 / 43
41. Conclusion
Provided numerical ways to compute
1. entropy, KLD, and f-divergences in low-rank tensor format
2. functions
f (·) = {sign(·), (·)−1
,
√
·, m
√
·, (·)k
, log (·), exp (·), (·)2
, | · |},
of pcf and pdf
3. low-rank approximations, which help to reduce the complexity and
storage from exponential O(nd
) to linear, e.g. O dnr2
for the TT
format.
40 / 43
42. Literature
1. A. Litvinenko, Y. Marzouk, H.G. Matthies, M. Scavino, A.
Spantini, Computing f-Divergences and Distances of
High-Dimensional Probability Density Functions – Low-Rank
Tensor Approximations, Numer Linear Algebra Appl. 2022;e2467.
https://doi.org/10.1002/nla.2467
2. M. Espig, W. Hackbusch, A. Litvinenko, H.G. Matthies, E Zander,
Iterative algorithms for the post-processing of high-dimensional
data, JCP 410, 109396, 2020,
https://doi.org/10.1016/j.jcp.2020.109396
3. S. Dolgov, A. Litvinenko, D. Liu, KRIGING IN TENSOR TRAIN
DATA FORMAT, Conf. Proc., 3rd Int. Conf. on Uncertainty
Quantification in CSE, https://files.eccomasproceedia.
org/papers/e-books/uncecomp_2019.pdf, pp 309-329, 2019
41 / 43
43. Literature
4. A. Litvinenko, D. Keyes, V. Khoromskaia, B.N. Khoromskij, H.G.
Matthies, Tucker tensor analysis of Matérn functions in spatial
statistics, Computational Methods in Applied Mathematics, vol.
19, no 1, 2019, pp 101-122,
https://doi.org/10.1515/cmam-2018-0022
5. A. Litvinenko, R. Kriemann, M.G. Genton, Y. Sun, D.E. Keyes,
HLIBCov: Parallel hierarchical matrix approximation of large
covariance matrices and likelihoods with applications in parameter
identification, MethodsX 7, 100600, 2020,
https://doi.org/10.1016/j.mex.2019.07.001
6. A. Litvinenko, Y. Sun, M.G. Genton, D.E. Keyes, Likelihood
approximation with hierarchical matrices for large spatial datasets,
Computational Statistics Data Analysis 137, pp 115-132, 2019,
https://doi.org/10.1016/j.csda.2019.02.002
42 / 43