Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A note on the density of Gumbel-softmax

199 views

Published on

This note explicates some details of the discussion
given in Appendix B of E. Jang, S. Gu, and B. Poole.
Categorical representation with Gumbel-softmax.
ICLR, 2017.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

A note on the density of Gumbel-softmax

  1. 1. A note on the density of Gumbel-softmax Tomonari MASADA @ Nagasaki University May 29, 2019 This note explicates some details of the discussion given in Appendix B of [1]. The Gumbel-softmax trick gives a k-dimensional sample vector y = (y1, . . . , yk) ∈ ∆k−1 whose entries are obtained as yi = exp((log(πi) + gi)/τ) k j=1 exp((log(πj) + gj)/τ) for i = 1, . . . , k, (1) by using g1, . . . , gk, which are i.i.d samples drawn from Gumbel(0, 1). Define xi = log(πi). Then y is rewritten as yi = exp((xi + gi)/τ) k j=1 exp((xj + gj)/τ) for i = 1, . . . , k, (2) Divide both numerator and denominator by exp((xk + gk)/τ). yi = exp((xi + gi − (xk + gk))/τ) k j=1 exp((xj + gj − (xk + gk))/τ) for i = 1, . . . , k, (3) Define ui = xi + gi − (xk + gk) for i = 1, . . . , k − 1, where gi ∼ Gumbel(0, 1). When gk is given, gi = ui−xi+(xk+gk) and ui can thus be regarded as a sample from the Gumbel whose mean is xi−(xk+gk) and scale parameter is 1. Therefore, p(ui|gk) = e−{(ui−xi+(xk+gk))+e−(ui−xi+(xk+gk)) } . Consequently, the density p(u1, . . . , uk−1) is given as follows: p(u1, . . . , uk−1) = ∞ −∞ dgkp(u1, . . . , uk−1|gk)p(gk) = ∞ −∞ dgkp(gk) k−1 i=1 p(ui|gk) = ∞ −∞ dgke−gk−e−gk k−1 i=1 exi−ui−xk−gk−exi−ui−xk−gk (4) Perform a change of variables with v = e−gk . Then dv dgk = −e−gk . Therefore, dv = −e−gk dgk and dgk = −dvegk = −dv/v. p(u1, . . . , uk−1) = 0 ∞ (−dv)e−v k−1 i=1 vexi−ui−xk−vexi−ui−xk = k−1 i=1 exi−ui−xk ∞ 0 dve−v vk−1 k−1 i=1 e−vexi−ui−xk = e−(k−1)xk k−1 i=1 exi−ui ∞ 0 dvvk−1 e−v(1+e−xk k−1 i=1 (exi−ui )) (5) Recall the following fact related to Gamma integral: ∞ 0 xz−1 e−ax dx = ∞ 0 y a z−1 e−y dy a = 1 a z ∞ 0 yz−1 e−y dy = a−z Γ(z) (6) 1
  2. 2. Therefore, p(u1, . . . , uk−1) = e−(k−1)xk k−1 i=1 exi−ui ∞ 0 dvvk−1 e−v(1+e−xk k−1 i=1 (exi−ui )) = e−kxk exk k−1 i=1 exi−ui 1 + e−xk k−1 i=1 (exi−ui ) −k Γ(k) = exk k−1 i=1 exi−ui exk + k−1 i=1 (exi−ui ) −k Γ(k) = exp xk + k−1 i=1 (xi − ui) exk + k−1 i=1 (exi−ui ) −k Γ(k) (7) Define uk = 0. Then p(u1, . . . , uk−1) = Γ(k) k i=1 exp(xi − ui) k i=1 exp(xi − ui) −k (8) A k-dimensional sample vector y = (y1, . . . , yk) ∈ ∆k−1 is obtained from u1, . . . , uk−1 by applying a deterministic transformation h: hi(u1, . . . , uk−1) = exp(ui/τ) 1 + k−1 j=1 exp(uj/τ) for i = 1, . . . , k − 1 (9) as follows: yi = hi(u1, . . . , uk−1) for i = 1, . . . , k − 1 (10) Note that yk is fixed given y1, . . . , yk−1: yk = 1 1 + k−1 j=1 exp(uj/τ) = 1 − k−1 j=1 yj (11) By using the change of variables we can obtain the density function for y: p(y) = p(h−1 (y1, . . . , yk−1)) det ∂(h−1 1 (y1, . . . , yk−1), . . . , h−1 k−1(y1, . . . , yk−1)) ∂(y1, . . . , yk−1) (12) The inverse of h is obtained as follows: yi = exp(ui/τ) 1 + k−1 j=1 exp(uj/τ) yi = yk exp(ui/τ) from yk = 1 1+ k−1 j=1 exp(uj /τ) log yi = log yk + ui/τ ∴ h−1 i (y1, . . . , yk−1) = ui = τ × (log yi − log yk) = τ × log yi − log 1 − k−1 j=1 yj (13) Therefore, we obtain the Jacobian: ∂h−1 i (y1, . . . , yk−1) ∂yi = τ × 1 yi − 1 yk ∂yk ∂yi = τ × 1 yi + 1 yk (14) ∂h−1 i (y1, . . . , yk−1) ∂yj = τ × − 1 yk ∂yk ∂yj = τ × 1 yk for j = i (15) Eqs. from (24) to (28) are easy to understand. References [1] E. Jang, S. Gu, and B. Poole. Categorical representation with Gumbel-softmax. ICLR, 2017. 2

×