Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019

Robust information bottleneck
Po-Ling Loh
University of Wisconsin
Columbia University
Department of Statistics
Program on deep learning opening workshop
SAMSI
August 12, 2019
Joint work with Varun Jog (UW-Madison)
With thanks to the Simons Institute!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 1 / 26

The problem
Despite remarkable success in prediction tasks, trained neural
networks are vulnerable to small, imperceptible perturbations
(Szegedy et al., ’13)

The remedy(?)
Data augmentation (Simard et al., ’03)

The remedy(?)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)

The remedy(?)
Gradients of loss with respect to x can be extracted from
backpropagation operation on network

The remedy(?)
Gradients of loss with respect to x can be extracted from
backpropagation operation on network
Randomized smoothing (Lecuyer et al., ’19)

Tradeoﬀs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small

Although classiﬁer becomes more robust, accuracy may be
compromised for points near boundary

Although classiﬁer becomes more robust, accuracy may be
compromised for points near boundary
How to quantify this?

One idea: Robust feature extraction
Try to base predictions on features that are relatively insensitive to
perturbations on inputs

Try to base predictions on features that are relatively insensitive to
perturbations on inputs
Example: Logistic regression classiﬁer projects covariates to xT
i β,
but some β vectors may be more robust than others

Example: Taken from “Robustness may be at odds with accuracy,”
Tsipras et al. (’18)
Yi ∼ {−1, +1}, x1 =
+y with prob. q,
−y with prob. 1 − q
,
x2, . . . , xp−1
i.i.d.
∼ N(ηy, 1)

Example: Taken from “Robustness may be at odds with accuracy,”
Tsipras et al. (’18)
Yi ∼ {−1, +1}, x1 =
+y with prob. q,
−y with prob. 1 − q
,
x2, . . . , xp−1
i.i.d.
∼ N(ηy, 1)
Another interesting method due to Garg et al. (’18) extracts features
using eigenvalues of graph Laplacian constructed over input space

Information bottleneck
Framework introduced by Tishby et al. (’99) for extracting features
that are both succinct and relevant
inf
T⊥⊥Y |X
{I(T; X) − βI(T; Y )}
Recall: I(T; X) = p(x, t) log p(x,t)
pX (x)pT (t) dxdt

Framework introduced by Tishby et al. (’99) for extracting features
that are both succinct and relevant
inf
T⊥⊥Y |X
{I(T; X) − βI(T; Y )}
Recall: I(T; X) = p(x, t) log p(x,t)
pX (x)pT (t) dxdt
Experiments reveal memorization phase where I(T; X) and I(T; Y )
initially increase, and compression phase where I(T; X) decreases and
I(T; Y ) increases (Schwartz-Ziv & Tishby, ’17)

Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)

Optimal set of features corresponds to linear projection T = AX +

Optimal set of features corresponds to linear projection T = AX +
Rows of A are eigenvectors of Σ−1
x Σxy Σ−1
y Σyx , rescaled by functions
of eigenvalues and tradeoﬀ parameter

Fisher information regularization
Consider T as parametrized by X, and deﬁne
Φ(T|X) = x log pT|X (t|x) 2
2pT|X (t|x)dt pX (x)dx

Fisher information regularization
Consider T as parametrized by X, and deﬁne
Φ(T|X) = x log pT|X (t|x) 2
2pT|X (t|x)dt pX (x)dx
Small Φ =⇒ distribution of T is not too sensitive to changes in X
(on average)

Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}

inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Simpler(?) formulation using Φ(T|X) for regularization
inf
T⊥⊥Y |X
{mmse(Y |T) + βΦ(T|X)}

inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Simpler(?) formulation using Φ(T|X) for regularization
inf
T⊥⊥Y |X
{mmse(Y |T) + βΦ(T|X)}
Here, deﬁne
mmse(Y |T) = E[(Y − E(Y |T))2
]

Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs

Remarks
Notably, both objective functions are invariant to scalings/bijective
transformations of T

Remarks
Notably, both objective functions are invariant to scalings/bijective
transformations of T
Thus, if we optimize over features of the form T = AX + , where
is Gaussian, we can assume ∼ N(0, I)

Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)

I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Perturbations in KL divergence:
sup
∆ 2≤δ
KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X)

I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Perturbations in KL divergence:
sup
∆ 2≤δ
KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X)
∴ small Φ(T|X) =⇒ small KL divergence between perturbed
distributions =⇒ small TV distance, Wasserstein distance, etc.

When T = AX + ξ, where ξ ∼ N(0, I), we have
Φ(T|X) = A 2
F ,
so lower SNR =⇒ more robust features

When T = AX + ξ, where ξ ∼ N(0, I), we have
Φ(T|X) = A 2
F ,
so lower SNR =⇒ more robust features
When T ∈ {+1, −1} with P(T = 1|X = x) = 1
1+exp(−xT w)
, we have
Φ(T|X = x) = w 2
2 · P(T = 1|X = x) · P(T = −1|X = x),
so regularizer encourages more conﬁdent predictions

Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian

Gaussian variables
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A

Gaussian variables
and optimize over A
For MMSE formulation, can rewrite objective as
min
A
tr Σy − Σyx (AΣx AT
+ I)−1
Σxy + β tr(AT
A)

Gaussian variables
and optimize over A
For MMSE formulation, can rewrite objective as
min
A
tr Σy − Σyx (AΣx AT
+ I)−1
Σxy + β tr(AT
A)
When Σx = σ2
x I, can show that optimal projection is A = 1
σx
D1/2UT ,
where U has columns equal to top eigenvectors of Σxy Σyx and entries
of D are
di =
λi
β − 1 if λi ≥ β,
0 otherwise

Gaussian variables
For mutual information formulation, optimal choice of T corresponds
to
A =
1
σx
(D−1
− I)1/2
UT
,
where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and
entries of D are
di = arg min
0≤d≤1
1
2
log
d
λi
+ 1 − γ log d +
β
d

Gaussian variables
For mutual information formulation, optimal choice of T corresponds
to
A =
1
σx
(D−1
− I)1/2
UT
,
where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and
entries of D are
di = arg min
0≤d≤1
1
2
log
d
λi
+ 1 − γ log d +
β
d
Again, larger β encourages d → 1, so scalings → 0

Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}

inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)

inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Idea is to restrict optimization to certain class of conditional
distributions pT|X

inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
distributions pT|X
Find minimizer of surrogate function which upper-bounds objective,
using SGD

inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
distributions pT|X
Find minimizer of surrogate function which upper-bounds objective,
using SGD
Builds upon idea of “variational autoencoders” (Kingma & Welling,
’13)

Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))

T|X = x ∼ N(µ(x; θ), Σ(x, θ))
For Φ(T|X) term, we can compute gradients explicitly:
x log pT|X (t|x) = x

 1
(2π)k k
j=1 σ2
j (x)
exp −
k
j=1
(tj − µj (x))2
2σj (x)2


= −
k
j=1
x σj (x)
σj (x)
+
k
j=1
(tj − µj (x))
σj (x)2 x µj (x) +
k
j=1
(tj − µj (x))2
σj (x)3 x σj (x)

T|X = x ∼ N(µ(x; θ), Σ(x, θ))
For Φ(T|X) term, we can compute gradients explicitly:
x log pT|X (t|x) = x

 1
(2π)k k
j=1 σ2
j (x)
exp −
k
j=1
(tj − µj (x))2
2σj (x)2


= −
k
j=1
x σj (x)
σj (x)
+
k
j=1
(tj − µj (x))
σj (x)2 x µj (x) +
k
j=1
(tj − µj (x))2
σj (x)3 x σj (x)
Importantly, Φ(T|X) (and its gradients) can be computed in terms of
gradients of parameters θ = ({µj }, {σj }) with respect to input x

For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ

mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
Importantly, all terms in variational objective can be approximated
from samples and the sum optimized using SGD

mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
Importantly, all terms in variational objective can be approximated
from samples and the sum optimized using SGD
✓
µ1 µ2 21
(
(
x
t
ˆy
⇠ N(µ, )
µ1 µ2 21
x
`(✓, , x, y, rxµ, rx )
rxµ, rx
✓
(
Encoder
Decoder
Loss function =
Encoder copy to get derivatives

More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt

More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Furthermore, we can use a reparametrization trick (Kingma & Welling
’13) to write latter integral as
(yi − ˜f (τ(xi , ; θ); φ))2
p ( )d ,
where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I)

More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Furthermore, we can use a reparametrization trick (Kingma & Welling
’13) to write latter integral as
(yi − ˜f (τ(xi , ; θ); φ))2
p ( )d ,
where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I)
Thus, we can easily generate stochastic gradients by plugging in
standard normal variables and input data

Variational approximation for −I(T; Y ) + γI(T; X) is more
complicated, but also involves training a decoder function
✓
µ1 µ2 21
(
(
x
t
ˆy
⇠ N(µ, )
µ1 µ2 21
x
`(✓, , x, y, rxµ, rx )
rxµ, rx
✓
(
Encoder
Decoder
Loss function =
Encoder copy to get derivatives

Experiments (Gaussian mixture)
Generate Y = ±1 with probability 1
2 each, and
X|Y = 1 ∼ N
1
1
,
16 0
0 1
, X|Y = −1 ∼ N
−1
−1
,
16 0
0 1

Generate Y = ±1 with probability 1
2 each, and
X|Y = 1 ∼ N
1
1
,
16 0
0 1
, X|Y = −1 ∼ N
−1
−1
,
16 0
0 1
Extract linear features: T = wT X + , where ∼ N(0, I)

MMSE formulation: infw {mmse(Y |T) + βΦ(T|X)}
As β increases, slope of w tilts and w 2 decreases

Experiments (MNIST)
≈ 50, 000 training images, 10, 000 testing images, 28 × 28 pixels
Two hidden layers of width 1024, output feature dimension k = 2
Decoder is logistic classiﬁer with softmax

Experiments (MNIST)
infw {−I(T; Y ) + γI(T; X) + βΦ(T|X)} with γ = 0
As β increases, accuracy of classiﬁer decreases and Fisher information
term decreases

Contributions
Presented new RIB method for extracting robust features

Contributions
Derived rigorous theory for optimal features in Gaussian setting

Contributions
Provided variational optimization method for extracting features in
general case

Contributions
Provided variational optimization method for extracting features in
general case
Preliminary experimental results are promising!

Future work
Show that this method is actually robust to adversarial (or
average-case) perturbations . . .
Explore pros and cons of mutual information vs. MMSE formulations
Alternative methods for robust feature extraction?

Future work
Show that this method is actually robust to adversarial (or
average-case) perturbations . . .
Explore pros and cons of mutual information vs. MMSE formulations
Alternative methods for robust feature extraction?
Thank you!

Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019

Recommended

Recommended

More Related Content

Similar to Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019

Similar to Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019 (9)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019