Computing f-Divergences and Distances of High-Dimensional
Probability Density Functions
A. Litvinenko1, H.G. Matthies2, Y. Marzouk3, M. Scavino4, A. Spantini3
1
RWTH Aachen, 2
TU Braunschweig, 3
MIT, 4 Universidad de la República, (IESTA), Montevideo
litvinenko@uq.rwth-aachen.de
1. Motivation
How to compute entropy, Kullback-Leibler (KL) and
other divergences if probability density function (pdf) is not available?
Two ways to compute f -divergence, KLD, entropy.
2. Connection of pcf and pdf
The probability characteristic function (pcf) ϕξ:
ϕξ(t) := E(exp(ihξ|ti)) :=
Z
Rd
pξ(y)exp(ihy|ti)dy =: F [d]
(pξ)(t),
where t = (t1,t2,...,td) ∈ Rd
is the dual variable to y ∈ Rd
,
hy|ti =
Pd
j=1 yjtj is the canonical inner product on Rd
, and F [d]
is the
probabilist’s d-dimensional Fourier transform.
pξ(y) =
1
(2π)d
Z
Rd
exp(−iht|yi)ϕξ(t)dt = F [−d]
(ϕξ)(y). (1)
3. Discrete low-rank representation of pcf
A full tensor w ∈ Rn1×n2×n3 (on the left) is represented as a sum of
tensor products. The lines on the right denote vectors wi,k ∈ Rnk,
i = 1,...,r, k = 1,2,3.
We try to find an approximation
ϕξ(t) ≈ e
ϕξ(t) =
R
X
`=1
d
O
ν=1
ϕ`,ν(tν), (2)
where ϕ`,ν(tν) are 1-dim. functions and then
pξ(y) ≈ e
pξ(y) = F [−d]
(ϕξ)y =
R
X
`=1
d
O
ν=1
F −1
1 (ϕ`,ν)(yν),
where F −1
1 is the one-dimensional inverse Fourier transform.
Differential entropy, requiring the point-wise logarithm of P:
h(p) := E(−log(p))p ≈ E(−log(P))P = −
V
N
hlog(P)|Pi,
Then the f -divergence of p from q and its discrete approximation is
defined as: DKL =
V
N
(hlog(P)|Pi − hlog(Q)|Pi) ,
(DH)2
=
V
2N
hP

Computing f-Divergences and Distances of High-Dimensional Probability Density Functions

  • 1.
    Computing f-Divergences andDistances of High-Dimensional Probability Density Functions A. Litvinenko1, H.G. Matthies2, Y. Marzouk3, M. Scavino4, A. Spantini3 1 RWTH Aachen, 2 TU Braunschweig, 3 MIT, 4 Universidad de la República, (IESTA), Montevideo litvinenko@uq.rwth-aachen.de 1. Motivation How to compute entropy, Kullback-Leibler (KL) and other divergences if probability density function (pdf) is not available? Two ways to compute f -divergence, KLD, entropy. 2. Connection of pcf and pdf The probability characteristic function (pcf) ϕξ: ϕξ(t) := E(exp(ihξ|ti)) := Z Rd pξ(y)exp(ihy|ti)dy =: F [d] (pξ)(t), where t = (t1,t2,...,td) ∈ Rd is the dual variable to y ∈ Rd , hy|ti = Pd j=1 yjtj is the canonical inner product on Rd , and F [d] is the probabilist’s d-dimensional Fourier transform. pξ(y) = 1 (2π)d Z Rd exp(−iht|yi)ϕξ(t)dt = F [−d] (ϕξ)(y). (1) 3. Discrete low-rank representation of pcf A full tensor w ∈ Rn1×n2×n3 (on the left) is represented as a sum of tensor products. The lines on the right denote vectors wi,k ∈ Rnk, i = 1,...,r, k = 1,2,3. We try to find an approximation ϕξ(t) ≈ e ϕξ(t) = R X `=1 d O ν=1 ϕ`,ν(tν), (2) where ϕ`,ν(tν) are 1-dim. functions and then pξ(y) ≈ e pξ(y) = F [−d] (ϕξ)y = R X `=1 d O ν=1 F −1 1 (ϕ`,ν)(yν), where F −1 1 is the one-dimensional inverse Fourier transform. Differential entropy, requiring the point-wise logarithm of P: h(p) := E(−log(p))p ≈ E(−log(P))P = − V N hlog(P)|Pi, Then the f -divergence of p from q and its discrete approximation is defined as: DKL = V N (hlog(P)|Pi − hlog(Q)|Pi) , (DH)2 = V 2N hP
  • 2.
  • 3.
  • 4.
  • 5.
    1/2 i 4. Numerics Example 1:KLD for two Gaussian distributions N1 := N (µ1,C1) and N2 := N (µ2,C2), where C1 := σ2 1 I, C2 := σ2 2 I, µ1 = (1.1...,1.1) and µ2 = (1.4,...,1.4) ∈ Rd , d = {16,32,64}, and σ1 = 1.5, σ2 = 22.1. AMEn tol. = 10−6 . The well known analytical formula is 2DKL(N1kN2) = tr(C−1 2 C1) + (µ2 − µ1)T C−1 2 (µ2 − µ1) − d + log |C2| |C1| Hellinger distance for Gaussian distributions: DH(N1,N2)2 = 1 − K1 2 (N1,N2), where K1 2 (N1,N2) = det(C1) 1 4 det(C2) 1 4 det C1+C2 2 1 2 ··exp       − 1 8 (µ1 − µ2) C1 + C2 2 !−1 (µ1 − µ2)        d 16 32 64 n 2048 2048 2048 DKL 35.08 70.16 140.32 e DKL 35.08 70.16 140.32 erra 4.0e-7 2.43e-5 1.4e-5 errr 1.1e-8 3.46e-8 8.1e-8 t, sec. 1.0 5.0 18.7 d 16 32 64 n 2048 2048 2048 DH 0.9999 0.9999 0.9999 e DH 0.9999 0.9999 0.9999 erra 3.5e-5 7.1e-5 1.4e-4 errr 2.5e-5 5.0e-5 1.0e-4 t, sec. 1.7 7.5 30.5 Example 2: DKL(α1,α2) between two α-stable distributions The pcf of a d-variate elliptically contoured α-stable distribution is given by ϕξ(t) = exp iht|µi − ht|Cti α 2 . We approximate ϕξ(t) in the TT format for (α1,α2) and fixed d = 8 and n = 64, µ1 = µ2 = 0, C1 = C2 = I, AMEn tol.= 10−12 . (α1,α2) (2.0,0.5) (2.0,1.0) (2.0,1.5) (2.0,1.9) (1.5,1.4) (1.0,0.4) DKL(α1,α2) 2.27 0.66 0.3 0.03 0.031 0.6 comp. time, sec. 8.4 7.8 7.5 8.5 11 8.7 max. TT rank 78 74 76 76 80 79 memory, MB 28.5 28.5 27.1 28.5 35 29.5 Example 3: Hellinger distance DH(α1,α2) for d-variate elliptically contoured α-stable distribution for α1 = 1.5 and α2 = 0.9 for differ- ent d and n; µ1 = µ2 = 0, C1 = C2 = I, AMEn tol.= 10−9 . d 16 16 16 16 16 16 32 32 32 32 n 8 16 32 64 128 256 16 32 64 128 DH(1.5,0.9) 0.218 0.223 0.223 0.223 0.219 0.223 0.180 0.176 0.175 0.176 comp. time, sec. 2.8 3.7 7.5 19 53 156 11 21 62 117 max. TT rank 79 76 76 76 79 76 75 71 75 74 memory, MB 7.7 17 34 71 145 283 34 66 144 285 Example 4: Computation of DH(α1,α2) between two α-stable distri- butions (α = 1.5 and α = 0.9) for different AMEn tolerances; n = 128, d = 32, µ1 = µ2 = 0, C1 = C2 = I. TT(AMEn) tolerance 10−7 10−8 10−9 10−10 10−14 DH(1.5,0.9) 0.1645 0.1817 0.176 0.1761 0.1802 comp. time, sec. 43 86 103 118 241 max. TT rank 64 75 75 78 77 memory, MB 126 255 270 307 322 Conclusion We demonstrated: 1.numerical computation of high-dimensional pdfs, as well as their divergences and distances, where pdf was assumed discretised on some regular grid. 2.pdfs and pcfs, and some functions of them in a low-rank tensor data format. 3.reduced the complexity and storage costs from exponential O(nd ) to linear, e.g. O dnr2 for the TT format. Acknowledgements: The research reported in this publication was partly supported by funding from the Alexander von Humboldt Foundation (AvH). References 1. A. Litvinenko, Y. Marzouk, H.G. Matthies, M. Scavino, A. Spantini, Computing f-Divergences and Distances of High-Dimensional Probability Density Functions – Low-Rank Tensor Approximations, arXiv: 2111.07164, 2021, https://arxiv.org/abs/2111.07164 2. M. Espig, W. Hackbusch, A. Litvinenko, H.G. Matthies, E Zander, Iterative algorithms for the post-processing of high-dimensional data, Journal of Computational Physics, Volume 410, 2020, 109396, ISSN 0021-9991 3. S. Dolgov, A. Litvinenko, D. Liu, Kriging in tensor train data format, UNCECOMP 2019 3rd ECCOMAS Thematic Conference on Uncertainty Quantification in CSE, M. Papadrakakis, V. Papadopoulos, G. Stefanou (eds.) Crete, Greece, 24-26 June 2019 4. A. Litvinenko, D. Keyes, V. Khoromskaia, B.N. Khoromskij, H.G. Matthies, Tucker tensor analysis of Matern functions in spatial statistics, Computational Methods in Applied Mathematics 19 (1), 101-122, 2019 5. A Litvinenko, R Kriemann, MG Genton, Y Sun, DE Keyes, HLIBCov: Parallel hierarchical matrix approximation of large covariance matrices and likelihoods with applications in parameter identification, MethodsX 7, 100600, 2020 6. A Litvinenko, Y Sun, MG Genton, DE Keyes, Likelihood approximation with hierarchical matrices for large spatial datasets, Computational Statistics Data Analysis 137, pp 115-132, 2019