REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

•

1 like•501 views

Sangwoo Mo

slide for "REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models", NIPS 2017.

Technology

REBAR: Low-variance, unbiased gradient estimates
for discrete latent variable models
Sangwoo Mo
KAIST AI Lab.
November 29, 2017
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 1 / 16

General Problem
Let z ∼ p(z|θ). Want to maximize
L(θ) = Ep(z)[f (z)1].
Example:
ELBO2
L(θ, φ) = Eqφ(z|x)[pθ(x|z)]
Policy Gradient
L(θ) = Epθ(τ)[R(τ)]
1
assume f (z) is independent to θ
2
omit KL term
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 2 / 16

General Problem
Let z ∼ p(z|θ). Want to maximize
L(θ) = Ep(z)[f (z)].
Want to optimize by gradient descent1. Need to compute
d
dθ
L(θ) =
d
dθ
Ep(z)[f (z)]
Caveat: We cannot simply put d
dθ inside since z depends on θ.
1
assume f (z) is diﬀerentiable
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 3 / 16

Background
REINFORCE:
d
dθ
Ep(z)[f (z)] =
d
dθ
f (z)p(z)dz
= f (z)
∂
∂θ
p(z)dz
= f (z)
∂
∂θ p(z)
p(z)
p(z)dz
= f (z)
∂
∂θ
log p(z)dz
= Ep(z) f (z)
∂
∂θ
log p(z)
It is unbiased, but variance is too high.
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 4 / 16

Background
Control variate: Subtract baseline c.
d
dθ
Ep(z)[f (z)] =
d
dθ
Ep(z,c)[f (z) − c] + Ep(z,c)[c]
= Ep(z,c) (f (z) − c)
∂
∂θ
log p(z) +
∂
∂θ
Ep(z,c)[c]
Qustion: How to choose proper1 c?
constant value e.g. Ep(z)[f (z)]
linear approximation of f arround Ep(z)[z]
1
i) c should be correlated to p(z), ii) if c
|=
θ, second term is eleminated
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 5 / 16

Background
Reparametrization trick: Assume z = g(θ, ).
d
dθ
Ep(z)[f (z)] =
d
dθ
f (z)p(z)dz
=
d
dθ
f (g(θ, ))p( )d
=
∂f
∂g
∂g
∂θ
p( )d
= Ep( )
∂f
∂g
∂g
∂θ
It is unbiased & low variance, and successful for continuous1 z
However, it is not directly applicable for discrete case
1
VAE assumes z ∼ N(µ, σ) and reparametrize it as z = µ + σ where ∼ N(0, 1)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 6 / 16

Background
Gumbel-softmax trick:
It is well-known that z ∼ Cat(θ) is equivalent to
z = H(w) = arg maxi [log θi − log(− log( i ))]
where H is hard argmax, w = g(θ, ), and i ∼ Uniform(0, 1).
Instead of H, use softmax σλ(w) (with temperature λ).
Then σλ(g(θ, )) is diﬀerentiable reparametrization of z.
It is low variance, but biased.
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 7 / 16

REBAR
Motivation:
Gumbel-softmax is highly correlated biased estimator
Use Gumbel-softmax as control variate of REINFORCE
However, we can do more than na¨ıvely applying this idea
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 8 / 16

REBAR
Observation:
We can reduce variance of REINFORCE by marginalizing w over z.
∂
∂θ
Ep(w) [f (σλ(w))] = Ep(w) f (σλ(w))
∂
∂θ
log p(w)
= Ep(z) Ep(w|z) f (σλ(w))
∂
∂θ
(log p(w|z) + log p(z))
= Ep(z)
∂
∂θ
Ep(w|z) [f (σλ(w))]
+ Ep(z) Ep(w|z)[f (σλ(w))]
∂
∂θ
log p(z)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 9 / 16

REBAR
Observation:
Here, the ﬁrst term can be reparametrized as
Ep(z)
∂
∂θ
Ep(w|z) [f (σλ(w))] = Ep(z) Ep(δ)
∂
∂θ
f (σλ(˜w))
where ˜w = ˜g(θ, z, δ)1 and δi ∼ Uniform(0, 1).
1
conditional distribution of g given z
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 10 / 16

REBAR
Putting it all together,
∂
∂θ
Ep(z)[f (z)] = E ,δ [f (H(w)) − ηf (σλ(˜w))]
∂
∂θ
log p(z)
z=H(w)
+ η
∂
∂θ
f (σλ(w)) − η
∂
∂θ
f (σλ(˜w))
where w = g(θ, ), ˜w = ˜g(θ, H(w), δ), and i , δi ∼ Uniform(0, 1).
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 11 / 16

Hyperparameter Optimization
Let r(η, λ) be the Monte Carlo REBAR estiamtor.
Since r is unbiased, E[r] does not depend on η and λ. Thus,
∂
∂η
Var(r) =
∂
∂η
E[r2
] − E[r]2
= E 2r
∂r
∂η
.
Now we can optimize η (and λ) to minimize variance.
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 12 / 16

Experiments
Minimize Ep(z)[(z − 0.45)2] where z ∼ Bernoulli(θ).
left: log variance / right: loss
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 13 / 16

Experiments
Maximize ELBO of Sigmoid Belief Network
log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)]
left: 2-layer linear / right: 1-layer nonlinear (log variance)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 14 / 16

Experiments
Maximize ELBO of Sigmoid Belief Network
log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)]
left: 2-layer linear / right: 1-layer nonlinear (objective)
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 15 / 16

Questions?
Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 16 / 16

What's hot

階層ベイズとWAICHiroshi Shimizu

距離と分類の話考司小杉

Deep walk についてTamakoshi Hironori

統計的因果推論からCausalMLまで走り抜けるスライドfusha

ブートストラップ法とその周辺とRDaisuke Yoneoka

ベイズ推論による機械学習入門　第４章YosukeAkasaka

ベイズ統計学の概論的紹介Naoki Hayashi

ディープボルツマンマシン入門Saya Katafuchi

20130716 はじパタ３章前半ベイズの識別規則koba cky

マハラノビス距離とユークリッド距離の違いwada, kazumi

実験計画法入門 Part 2haji mizu

統計的学習の基礎　5章前半（~5.6）Kota Mori

星野「調査観察データの統計科学」第3章Shuyo Nakatani

階層ベイズによるワンToワンマーケティング入門shima o

FDRの使い方 (Kashiwa.R #3)Haruka Ozaki

階層モデルの分散パラメータの事前分布についてhoxo_m

充足可能性問題のいろいろHiroshi Yamashita

「内積が見えると統計学も見える」第5回プログラマのための数学勉強会発表資料 Ken'ichi Matsui

Dynamic Time Warping を用いた高頻度取引データのLead-Lag 効果の推定Katsuya Ito

Juliaで並列計算Shintaro Fukushima

What's hot (20)

階層ベイズとWAIC

距離と分類の話

Deep walk について

統計的因果推論からCausalMLまで走り抜けるスライド

ブートストラップ法とその周辺とR

ベイズ推論による機械学習入門　第４章

ベイズ統計学の概論的紹介

ディープボルツマンマシン入門

20130716 はじパタ３章前半ベイズの識別規則

マハラノビス距離とユークリッド距離の違い

実験計画法入門 Part 2

統計的学習の基礎　5章前半（~5.6）

星野「調査観察データの統計科学」第3章

階層ベイズによるワンToワンマーケティング入門

FDRの使い方 (Kashiwa.R #3)

階層モデルの分散パラメータの事前分布について

充足可能性問題のいろいろ

「内積が見えると統計学も見える」第5回プログラマのための数学勉強会発表資料

Dynamic Time Warping を用いた高頻度取引データのLead-Lag 効果の推定

Juliaで並列計算

Similar to REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

Darmon Points: an Overviewmmasdeu

On the Jensen-Shannon symmetrization of distances relying on abstract meansFrank Nielsen

Approximate Bayesian Computation with Quasi-LikelihoodsStefano Cabras

On the-approximate-solution-of-a-nonlinear-singular-integral-equationCemal Ardil

Group theory notesmkumaresan

Asymptotics for discrete random measuresJulyan Arbel

Note on Character Theory-summer 2013Fan Huang (Wright)

A STUDY ON L-FUZZY NORMAL SUBl -GROUPmathsjournal

A Unified Perspective for Darmon Pointsmmasdeu

cmftJYeZhuanTalk.pdfjyjyzr69t7

Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo

Berezin-Toeplitz Quantization On Coadjoint orbitsHassan Jolany

l1-Embeddings and Algorithmic ApplicationsGrigory Yaroslavtsev

Meta-learning and the ELBOYoonho Lee

A Note on Latent LSTM AllocationTomonari Masada

Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka

Matrix calculusSungbin Lim

MUMS: Bayesian, Fiducial, and Frequentist Conference - Can a Fiducial Phoenix...The Statistical and Applied Mathematical Sciences Institute

Continuous and Discrete-Time Analysis of SGDValentin De Bortoli

Rainone - Groups St. Andrew 2013Raffaele Rainone

Similar to REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models (20)

Darmon Points: an Overview

On the Jensen-Shannon symmetrization of distances relying on abstract means

Approximate Bayesian Computation with Quasi-Likelihoods

On the-approximate-solution-of-a-nonlinear-singular-integral-equation

Group theory notes

Asymptotics for discrete random measures

Note on Character Theory-summer 2013

A STUDY ON L-FUZZY NORMAL SUBl -GROUP

A Unified Perspective for Darmon Points

cmftJYeZhuanTalk.pdf

Improved Trainings of Wasserstein GANs (WGAN-GP)

Berezin-Toeplitz Quantization On Coadjoint orbits

l1-Embeddings and Algorithmic Applications

Meta-learning and the ELBO

A Note on Latent LSTM Allocation

Murphy: Machine learning A probabilistic perspective: Ch.9

Matrix calculus

MUMS: Bayesian, Fiducial, and Frequentist Conference - Can a Fiducial Phoenix...

Continuous and Discrete-Time Analysis of SGD

Rainone - Groups St. Andrew 2013

Recently uploaded

Slack Application Development 101 Slidespraypatel2

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to convert PDF to text with Nanonetsnaman860154

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

Recently uploaded (20)

Slack Application Development 101 Slides

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

08448380779 Call Girls In Civil Lines Women Seeking Men

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Maximizing Board Effectiveness 2024 Webinar.pptx

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Injustice - Developers Among Us (SciFiDevCon 2024)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

How to convert PDF to text with Nanonets

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

08448380779 Call Girls In Friends Colony Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

My Hashitalk Indonesia April 2024 Presentation

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Presentation on how to chat with PDF using ChatGPT code interpreter

Pigging Solutions Piggable Sweeping Elbows

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

1. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models Sangwoo Mo KAIST AI Lab. November 29, 2017 Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 1 / 16

2. General Problem Let z ∼ p(z|θ). Want to maximize L(θ) = Ep(z)[f (z)1]. Example: ELBO2 L(θ, φ) = Eqφ(z|x)[pθ(x|z)] Policy Gradient L(θ) = Epθ(τ)[R(τ)] 1 assume f (z) is independent to θ 2 omit KL term Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 2 / 16

3. General Problem Let z ∼ p(z|θ). Want to maximize L(θ) = Ep(z)[f (z)]. Want to optimize by gradient descent1. Need to compute d dθ L(θ) = d dθ Ep(z)[f (z)] Caveat: We cannot simply put d dθ inside since z depends on θ. 1 assume f (z) is diﬀerentiable Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 3 / 16

4. Background REINFORCE: d dθ Ep(z)[f (z)] = d dθ f (z)p(z)dz = f (z) ∂ ∂θ p(z)dz = f (z) ∂ ∂θ p(z) p(z) p(z)dz = f (z) ∂ ∂θ log p(z)dz = Ep(z) f (z) ∂ ∂θ log p(z) It is unbiased, but variance is too high. Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 4 / 16

5. Background Control variate: Subtract baseline c. d dθ Ep(z)[f (z)] = d dθ Ep(z,c)[f (z) − c] + Ep(z,c)[c] = Ep(z,c) (f (z) − c) ∂ ∂θ log p(z) + ∂ ∂θ Ep(z,c)[c] Qustion: How to choose proper1 c? constant value e.g. Ep(z)[f (z)] linear approximation of f arround Ep(z)[z] 1 i) c should be correlated to p(z), ii) if c |= θ, second term is eleminated Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 5 / 16

6. Background Reparametrization trick: Assume z = g(θ, ). d dθ Ep(z)[f (z)] = d dθ f (z)p(z)dz = d dθ f (g(θ, ))p( )d = ∂f ∂g ∂g ∂θ p( )d = Ep( ) ∂f ∂g ∂g ∂θ It is unbiased & low variance, and successful for continuous1 z However, it is not directly applicable for discrete case 1 VAE assumes z ∼ N(µ, σ) and reparametrize it as z = µ + σ where ∼ N(0, 1) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 6 / 16

7. Background Gumbel-softmax trick: It is well-known that z ∼ Cat(θ) is equivalent to z = H(w) = arg maxi [log θi − log(− log( i ))] where H is hard argmax, w = g(θ, ), and i ∼ Uniform(0, 1). Instead of H, use softmax σλ(w) (with temperature λ). Then σλ(g(θ, )) is diﬀerentiable reparametrization of z. It is low variance, but biased. Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 7 / 16

8. REBAR Motivation: Gumbel-softmax is highly correlated biased estimator Use Gumbel-softmax as control variate of REINFORCE However, we can do more than na¨ıvely applying this idea Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 8 / 16

9. REBAR Observation: We can reduce variance of REINFORCE by marginalizing w over z. ∂ ∂θ Ep(w) [f (σλ(w))] = Ep(w) f (σλ(w)) ∂ ∂θ log p(w) = Ep(z) Ep(w|z) f (σλ(w)) ∂ ∂θ (log p(w|z) + log p(z)) = Ep(z) ∂ ∂θ Ep(w|z) [f (σλ(w))] + Ep(z) Ep(w|z)[f (σλ(w))] ∂ ∂θ log p(z) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 9 / 16

10. REBAR Observation: Here, the ﬁrst term can be reparametrized as Ep(z) ∂ ∂θ Ep(w|z) [f (σλ(w))] = Ep(z) Ep(δ) ∂ ∂θ f (σλ(˜w)) where ˜w = ˜g(θ, z, δ)1 and δi ∼ Uniform(0, 1). 1 conditional distribution of g given z Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 10 / 16

11. REBAR Putting it all together, ∂ ∂θ Ep(z)[f (z)] = E ,δ [f (H(w)) − ηf (σλ(˜w))] ∂ ∂θ log p(z) z=H(w) + η ∂ ∂θ f (σλ(w)) − η ∂ ∂θ f (σλ(˜w)) where w = g(θ, ), ˜w = ˜g(θ, H(w), δ), and i , δi ∼ Uniform(0, 1). Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 11 / 16

12. Hyperparameter Optimization Let r(η, λ) be the Monte Carlo REBAR estiamtor. Since r is unbiased, E[r] does not depend on η and λ. Thus, ∂ ∂η Var(r) = ∂ ∂η E[r2 ] − E[r]2 = E 2r ∂r ∂η . Now we can optimize η (and λ) to minimize variance. Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 12 / 16

13. Experiments Minimize Ep(z)[(z − 0.45)2] where z ∼ Bernoulli(θ). left: log variance / right: loss Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 13 / 16

14. Experiments Maximize ELBO of Sigmoid Belief Network log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)] left: 2-layer linear / right: 1-layer nonlinear (log variance) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 14 / 16

15. Experiments Maximize ELBO of Sigmoid Belief Network log p(x|θ) ≥ Eq(z|x,θ)[log p(x, z|θ) − log q(z|x, θ)] left: 2-layer linear / right: 1-layer nonlinear (objective) Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 15 / 16

16. Questions? Sangwoo Mo (KAIST AI Lab.) REBAR November 29, 2017 16 / 16

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models

Similar to REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models (20)

More from Sangwoo Mo

More from Sangwoo Mo (20)

Recently uploaded

Recently uploaded (20)

REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models