SlideShare a Scribd company logo
1 of 64
Download to read offline
Robust information bottleneck
Po-Ling Loh
University of Wisconsin
Columbia University
Department of Statistics
Program on deep learning opening workshop
SAMSI
August 12, 2019
Joint work with Varun Jog (UW-Madison)
With thanks to the Simons Institute!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 1 / 26
The problem
Despite remarkable success in prediction tasks, trained neural
networks are vulnerable to small, imperceptible perturbations
(Szegedy et al., ’13)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 2 / 26
The remedy(?)
Data augmentation (Simard et al., ’03)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
The remedy(?)
Data augmentation (Simard et al., ’03)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
The remedy(?)
Data augmentation (Simard et al., ’03)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
The remedy(?)
Data augmentation (Simard et al., ’03)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)
Gradients of loss with respect to x can be extracted from
backpropagation operation on network
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
The remedy(?)
Data augmentation (Simard et al., ’03)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)
Gradients of loss with respect to x can be extracted from
backpropagation operation on network
Randomized smoothing (Lecuyer et al., ’19)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
Tradeoffs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
Tradeoffs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small
Although classifier becomes more robust, accuracy may be
compromised for points near boundary
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
Tradeoffs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small
Although classifier becomes more robust, accuracy may be
compromised for points near boundary
How to quantify this?
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
One idea: Robust feature extraction
Try to base predictions on features that are relatively insensitive to
perturbations on inputs
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 5 / 26
One idea: Robust feature extraction
Try to base predictions on features that are relatively insensitive to
perturbations on inputs
Example: Logistic regression classifier projects covariates to xT
i β,
but some β vectors may be more robust than others
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 5 / 26
One idea: Robust feature extraction
Example: Taken from “Robustness may be at odds with accuracy,”
Tsipras et al. (’18)
Yi ∼ {−1, +1}, x1 =
+y with prob. q,
−y with prob. 1 − q
,
x2, . . . , xp−1
i.i.d.
∼ N(ηy, 1)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 6 / 26
One idea: Robust feature extraction
Example: Taken from “Robustness may be at odds with accuracy,”
Tsipras et al. (’18)
Yi ∼ {−1, +1}, x1 =
+y with prob. q,
−y with prob. 1 − q
,
x2, . . . , xp−1
i.i.d.
∼ N(ηy, 1)
Another interesting method due to Garg et al. (’18) extracts features
using eigenvalues of graph Laplacian constructed over input space
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 6 / 26
Information bottleneck
Framework introduced by Tishby et al. (’99) for extracting features
that are both succinct and relevant
inf
T⊥⊥Y |X
{I(T; X) − βI(T; Y )}
Recall: I(T; X) = p(x, t) log p(x,t)
pX (x)pT (t) dxdt
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 7 / 26
Information bottleneck
Framework introduced by Tishby et al. (’99) for extracting features
that are both succinct and relevant
inf
T⊥⊥Y |X
{I(T; X) − βI(T; Y )}
Recall: I(T; X) = p(x, t) log p(x,t)
pX (x)pT (t) dxdt
Experiments reveal memorization phase where I(T; X) and I(T; Y )
initially increase, and compression phase where I(T; X) decreases and
I(T; Y ) increases (Schwartz-Ziv & Tishby, ’17)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 7 / 26
Information bottleneck
Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
Information bottleneck
Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)
Optimal set of features corresponds to linear projection T = AX +
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
Information bottleneck
Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)
Optimal set of features corresponds to linear projection T = AX +
Rows of A are eigenvectors of Σ−1
x Σxy Σ−1
y Σyx , rescaled by functions
of eigenvalues and tradeoff parameter
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
Fisher information regularization
Consider T as parametrized by X, and define
Φ(T|X) = x log pT|X (t|x) 2
2pT|X (t|x)dt pX (x)dx
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 9 / 26
Fisher information regularization
Consider T as parametrized by X, and define
Φ(T|X) = x log pT|X (t|x) 2
2pT|X (t|x)dt pX (x)dx
Small Φ =⇒ distribution of T is not too sensitive to changes in X
(on average)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 9 / 26
Robust information bottleneck
Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
Robust information bottleneck
Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Simpler(?) formulation using Φ(T|X) for regularization
inf
T⊥⊥Y |X
{mmse(Y |T) + βΦ(T|X)}
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
Robust information bottleneck
Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Simpler(?) formulation using Φ(T|X) for regularization
inf
T⊥⊥Y |X
{mmse(Y |T) + βΦ(T|X)}
Here, define
mmse(Y |T) = E[(Y − E(Y |T))2
]
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs
Notably, both objective functions are invariant to scalings/bijective
transformations of T
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs
Notably, both objective functions are invariant to scalings/bijective
transformations of T
Thus, if we optimize over features of the form T = AX + , where
is Gaussian, we can assume ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Perturbations in KL divergence:
sup
∆ 2≤δ
KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Perturbations in KL divergence:
sup
∆ 2≤δ
KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X)
∴ small Φ(T|X) =⇒ small KL divergence between perturbed
distributions =⇒ small TV distance, Wasserstein distance, etc.
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
Properties of regularizer
When T = AX + ξ, where ξ ∼ N(0, I), we have
Φ(T|X) = A 2
F ,
so lower SNR =⇒ more robust features
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 13 / 26
Properties of regularizer
When T = AX + ξ, where ξ ∼ N(0, I), we have
Φ(T|X) = A 2
F ,
so lower SNR =⇒ more robust features
When T ∈ {+1, −1} with P(T = 1|X = x) = 1
1+exp(−xT w)
, we have
Φ(T|X = x) = w 2
2 · P(T = 1|X = x) · P(T = −1|X = x),
so regularizer encourages more confident predictions
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 13 / 26
Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A
For MMSE formulation, can rewrite objective as
min
A
tr Σy − Σyx (AΣx AT
+ I)−1
Σxy + β tr(AT
A)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A
For MMSE formulation, can rewrite objective as
min
A
tr Σy − Σyx (AΣx AT
+ I)−1
Σxy + β tr(AT
A)
When Σx = σ2
x I, can show that optimal projection is A = 1
σx
D1/2UT ,
where U has columns equal to top eigenvectors of Σxy Σyx and entries
of D are
di =
λi
β − 1 if λi ≥ β,
0 otherwise
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
Gaussian variables
For mutual information formulation, optimal choice of T corresponds
to
A =
1
σx
(D−1
− I)1/2
UT
,
where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and
entries of D are
di = arg min
0≤d≤1
1
2
log
d
λi
+ 1 − γ log d +
β
d
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 15 / 26
Gaussian variables
For mutual information formulation, optimal choice of T corresponds
to
A =
1
σx
(D−1
− I)1/2
UT
,
where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and
entries of D are
di = arg min
0≤d≤1
1
2
log
d
λi
+ 1 − γ log d +
β
d
Again, larger β encourages d → 1, so scalings → 0
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 15 / 26
Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Idea is to restrict optimization to certain class of conditional
distributions pT|X
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Idea is to restrict optimization to certain class of conditional
distributions pT|X
Find minimizer of surrogate function which upper-bounds objective,
using SGD
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Idea is to restrict optimization to certain class of conditional
distributions pT|X
Find minimizer of surrogate function which upper-bounds objective,
using SGD
Builds upon idea of “variational autoencoders” (Kingma & Welling,
’13)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
Variational optimization strategy
Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
Variational optimization strategy
Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))
For Φ(T|X) term, we can compute gradients explicitly:
x log pT|X (t|x) = x

 1
(2π)k k
j=1 σ2
j (x)
exp −
k
j=1
(tj − µj (x))2
2σj (x)2


= −
k
j=1
x σj (x)
σj (x)
+
k
j=1
(tj − µj (x))
σj (x)2 x µj (x) +
k
j=1
(tj − µj (x))2
σj (x)3 x σj (x)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
Variational optimization strategy
Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))
For Φ(T|X) term, we can compute gradients explicitly:
x log pT|X (t|x) = x

 1
(2π)k k
j=1 σ2
j (x)
exp −
k
j=1
(tj − µj (x))2
2σj (x)2


= −
k
j=1
x σj (x)
σj (x)
+
k
j=1
(tj − µj (x))
σj (x)2 x µj (x) +
k
j=1
(tj − µj (x))2
σj (x)3 x σj (x)
Importantly, Φ(T|X) (and its gradients) can be computed in terms of
gradients of parameters θ = ({µj }, {σj }) with respect to input x
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
Variational optimization strategy
For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
Variational optimization strategy
For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ
Importantly, all terms in variational objective can be approximated
from samples and the sum optimized using SGD
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
Variational optimization strategy
For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ
Importantly, all terms in variational objective can be approximated
from samples and the sum optimized using SGD
✓
µ1 µ2 21
(
(
x
t
ˆy
⇠ N(µ, )
µ1 µ2 21
x
`(✓, , x, y, rxµ, rx )
rxµ, rx
✓
(
Encoder
Decoder
Loss function =
Encoder copy to get derivatives
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
Variational optimization strategy
More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
Variational optimization strategy
More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Furthermore, we can use a reparametrization trick (Kingma & Welling
’13) to write latter integral as
(yi − ˜f (τ(xi , ; θ); φ))2
p ( )d ,
where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
Variational optimization strategy
More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Furthermore, we can use a reparametrization trick (Kingma & Welling
’13) to write latter integral as
(yi − ˜f (τ(xi , ; θ); φ))2
p ( )d ,
where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I)
Thus, we can easily generate stochastic gradients by plugging in
standard normal variables and input data
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
Variational optimization strategy
Variational approximation for −I(T; Y ) + γI(T; X) is more
complicated, but also involves training a decoder function
✓
µ1 µ2 21
(
(
x
t
ˆy
⇠ N(µ, )
µ1 µ2 21
x
`(✓, , x, y, rxµ, rx )
rxµ, rx
✓
(
Encoder
Decoder
Loss function =
Encoder copy to get derivatives
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 20 / 26
Experiments (Gaussian mixture)
Generate Y = ±1 with probability 1
2 each, and
X|Y = 1 ∼ N
1
1
,
16 0
0 1
, X|Y = −1 ∼ N
−1
−1
,
16 0
0 1
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 21 / 26
Experiments (Gaussian mixture)
Generate Y = ±1 with probability 1
2 each, and
X|Y = 1 ∼ N
1
1
,
16 0
0 1
, X|Y = −1 ∼ N
−1
−1
,
16 0
0 1
Extract linear features: T = wT X + , where ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 21 / 26
Experiments (Gaussian mixture)
MMSE formulation: infw {mmse(Y |T) + βΦ(T|X)}
As β increases, slope of w tilts and w 2 decreases
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 22 / 26
Experiments (MNIST)
≈ 50, 000 training images, 10, 000 testing images, 28 × 28 pixels
Two hidden layers of width 1024, output feature dimension k = 2
Decoder is logistic classifier with softmax
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 23 / 26
Experiments (MNIST)
Mutual information formulation:
infw {−I(T; Y ) + γI(T; X) + βΦ(T|X)} with γ = 0
As β increases, accuracy of classifier decreases and Fisher information
term decreases
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 24 / 26
Contributions
Presented new RIB method for extracting robust features
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
Contributions
Presented new RIB method for extracting robust features
Derived rigorous theory for optimal features in Gaussian setting
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
Contributions
Presented new RIB method for extracting robust features
Derived rigorous theory for optimal features in Gaussian setting
Provided variational optimization method for extracting features in
general case
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
Contributions
Presented new RIB method for extracting robust features
Derived rigorous theory for optimal features in Gaussian setting
Provided variational optimization method for extracting features in
general case
Preliminary experimental results are promising!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
Future work
Show that this method is actually robust to adversarial (or
average-case) perturbations . . .
Explore pros and cons of mutual information vs. MMSE formulations
Alternative methods for robust feature extraction?
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 26 / 26
Future work
Show that this method is actually robust to adversarial (or
average-case) perturbations . . .
Explore pros and cons of mutual information vs. MMSE formulations
Alternative methods for robust feature extraction?
Thank you!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 26 / 26

More Related Content

Similar to Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019

Similar to Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019 (9)

Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...
 
Master Thesis Presentation (Subselection of Topics)
Master Thesis Presentation (Subselection of Topics)Master Thesis Presentation (Subselection of Topics)
Master Thesis Presentation (Subselection of Topics)
 
Pattern baysin
Pattern baysinPattern baysin
Pattern baysin
 
GIS-Based Adaptive Neuro Fuzzy Inference System (ANFIS) Method for Vulnerabil...
GIS-Based Adaptive Neuro Fuzzy Inference System (ANFIS) Method for Vulnerabil...GIS-Based Adaptive Neuro Fuzzy Inference System (ANFIS) Method for Vulnerabil...
GIS-Based Adaptive Neuro Fuzzy Inference System (ANFIS) Method for Vulnerabil...
 
zkStudyClub - cqlin: Efficient linear operations on KZG commitments
zkStudyClub - cqlin: Efficient linear operations on KZG commitments zkStudyClub - cqlin: Efficient linear operations on KZG commitments
zkStudyClub - cqlin: Efficient linear operations on KZG commitments
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Unit-2.ppt
Unit-2.pptUnit-2.ppt
Unit-2.ppt
 
FDA and Statistical learning theory
FDA and Statistical learning theoryFDA and Statistical learning theory
FDA and Statistical learning theory
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 

More from The Statistical and Applied Mathematical Sciences Institute

Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
The Statistical and Applied Mathematical Sciences Institute
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
The Statistical and Applied Mathematical Sciences Institute
 

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
EADTU
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 

Recently uploaded (20)

Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 

Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019

  • 1. Robust information bottleneck Po-Ling Loh University of Wisconsin Columbia University Department of Statistics Program on deep learning opening workshop SAMSI August 12, 2019 Joint work with Varun Jog (UW-Madison) With thanks to the Simons Institute! Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 1 / 26
  • 2. The problem Despite remarkable success in prediction tasks, trained neural networks are vulnerable to small, imperceptible perturbations (Szegedy et al., ’13) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 2 / 26
  • 3. The remedy(?) Data augmentation (Simard et al., ’03) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
  • 4. The remedy(?) Data augmentation (Simard et al., ’03) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
  • 5. The remedy(?) Data augmentation (Simard et al., ’03) Adversarial training using fast gradient sign method (FGSM): xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
  • 6. The remedy(?) Data augmentation (Simard et al., ’03) Adversarial training using fast gradient sign method (FGSM): xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14) Gradients of loss with respect to x can be extracted from backpropagation operation on network Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
  • 7. The remedy(?) Data augmentation (Simard et al., ’03) Adversarial training using fast gradient sign method (FGSM): xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14) Gradients of loss with respect to x can be extracted from backpropagation operation on network Randomized smoothing (Lecuyer et al., ’19) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
  • 8. Tradeoffs between robustness and accuracy Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x < Prediction accuracy: Want E[L(frob(xi ), yi )] to be small Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
  • 9. Tradeoffs between robustness and accuracy Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x < Prediction accuracy: Want E[L(frob(xi ), yi )] to be small Although classifier becomes more robust, accuracy may be compromised for points near boundary Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
  • 10. Tradeoffs between robustness and accuracy Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x < Prediction accuracy: Want E[L(frob(xi ), yi )] to be small Although classifier becomes more robust, accuracy may be compromised for points near boundary How to quantify this? Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
  • 11. One idea: Robust feature extraction Try to base predictions on features that are relatively insensitive to perturbations on inputs Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 5 / 26
  • 12. One idea: Robust feature extraction Try to base predictions on features that are relatively insensitive to perturbations on inputs Example: Logistic regression classifier projects covariates to xT i β, but some β vectors may be more robust than others Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 5 / 26
  • 13. One idea: Robust feature extraction Example: Taken from “Robustness may be at odds with accuracy,” Tsipras et al. (’18) Yi ∼ {−1, +1}, x1 = +y with prob. q, −y with prob. 1 − q , x2, . . . , xp−1 i.i.d. ∼ N(ηy, 1) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 6 / 26
  • 14. One idea: Robust feature extraction Example: Taken from “Robustness may be at odds with accuracy,” Tsipras et al. (’18) Yi ∼ {−1, +1}, x1 = +y with prob. q, −y with prob. 1 − q , x2, . . . , xp−1 i.i.d. ∼ N(ηy, 1) Another interesting method due to Garg et al. (’18) extracts features using eigenvalues of graph Laplacian constructed over input space Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 6 / 26
  • 15. Information bottleneck Framework introduced by Tishby et al. (’99) for extracting features that are both succinct and relevant inf T⊥⊥Y |X {I(T; X) − βI(T; Y )} Recall: I(T; X) = p(x, t) log p(x,t) pX (x)pT (t) dxdt Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 7 / 26
  • 16. Information bottleneck Framework introduced by Tishby et al. (’99) for extracting features that are both succinct and relevant inf T⊥⊥Y |X {I(T; X) − βI(T; Y )} Recall: I(T; X) = p(x, t) log p(x,t) pX (x)pT (t) dxdt Experiments reveal memorization phase where I(T; X) and I(T; Y ) initially increase, and compression phase where I(T; X) decreases and I(T; Y ) increases (Schwartz-Ziv & Tishby, ’17) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 7 / 26
  • 17. Information bottleneck Rigorous theory for form of extracted features derived when (X, Y ) are jointly Gaussian (Chechik et al., ’05) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
  • 18. Information bottleneck Rigorous theory for form of extracted features derived when (X, Y ) are jointly Gaussian (Chechik et al., ’05) Optimal set of features corresponds to linear projection T = AX + Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
  • 19. Information bottleneck Rigorous theory for form of extracted features derived when (X, Y ) are jointly Gaussian (Chechik et al., ’05) Optimal set of features corresponds to linear projection T = AX + Rows of A are eigenvectors of Σ−1 x Σxy Σ−1 y Σyx , rescaled by functions of eigenvalues and tradeoff parameter Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
  • 20. Fisher information regularization Consider T as parametrized by X, and define Φ(T|X) = x log pT|X (t|x) 2 2pT|X (t|x)dt pX (x)dx Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 9 / 26
  • 21. Fisher information regularization Consider T as parametrized by X, and define Φ(T|X) = x log pT|X (t|x) 2 2pT|X (t|x)dt pX (x)dx Small Φ =⇒ distribution of T is not too sensitive to changes in X (on average) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 9 / 26
  • 22. Robust information bottleneck Mutual information formulation: inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
  • 23. Robust information bottleneck Mutual information formulation: inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} Simpler(?) formulation using Φ(T|X) for regularization inf T⊥⊥Y |X {mmse(Y |T) + βΦ(T|X)} Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
  • 24. Robust information bottleneck Mutual information formulation: inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} Simpler(?) formulation using Φ(T|X) for regularization inf T⊥⊥Y |X {mmse(Y |T) + βΦ(T|X)} Here, define mmse(Y |T) = E[(Y − E(Y |T))2 ] Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
  • 25. Remarks MMSE formulation may be preferable when Y is continuous, to quantify closeness between predicted values and outputs Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
  • 26. Remarks MMSE formulation may be preferable when Y is continuous, to quantify closeness between predicted values and outputs Notably, both objective functions are invariant to scalings/bijective transformations of T Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
  • 27. Remarks MMSE formulation may be preferable when Y is continuous, to quantify closeness between predicted values and outputs Notably, both objective functions are invariant to scalings/bijective transformations of T Thus, if we optimize over features of the form T = AX + , where is Gaussian, we can assume ∼ N(0, I) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
  • 28. Properties of regularizer Perturbations in mutual information: I(X; T) − I(X + √ δZ; T) = δ 2 Φ(T|X) + o(δ), where Z ∼ N(0, I) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
  • 29. Properties of regularizer Perturbations in mutual information: I(X; T) − I(X + √ δZ; T) = δ 2 Φ(T|X) + o(δ), where Z ∼ N(0, I) Perturbations in KL divergence: sup ∆ 2≤δ KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
  • 30. Properties of regularizer Perturbations in mutual information: I(X; T) − I(X + √ δZ; T) = δ 2 Φ(T|X) + o(δ), where Z ∼ N(0, I) Perturbations in KL divergence: sup ∆ 2≤δ KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X) ∴ small Φ(T|X) =⇒ small KL divergence between perturbed distributions =⇒ small TV distance, Wasserstein distance, etc. Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
  • 31. Properties of regularizer When T = AX + ξ, where ξ ∼ N(0, I), we have Φ(T|X) = A 2 F , so lower SNR =⇒ more robust features Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 13 / 26
  • 32. Properties of regularizer When T = AX + ξ, where ξ ∼ N(0, I), we have Φ(T|X) = A 2 F , so lower SNR =⇒ more robust features When T ∈ {+1, −1} with P(T = 1|X = x) = 1 1+exp(−xT w) , we have Φ(T|X = x) = w 2 2 · P(T = 1|X = x) · P(T = −1|X = x), so regularizer encourages more confident predictions Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 13 / 26
  • 33. Gaussian variables Can be shown (in both RIB formulations) that when (X, Y ) are jointly Gaussian, optimal T is also Gaussian Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
  • 34. Gaussian variables Can be shown (in both RIB formulations) that when (X, Y ) are jointly Gaussian, optimal T is also Gaussian Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p, and optimize over A Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
  • 35. Gaussian variables Can be shown (in both RIB formulations) that when (X, Y ) are jointly Gaussian, optimal T is also Gaussian Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p, and optimize over A For MMSE formulation, can rewrite objective as min A tr Σy − Σyx (AΣx AT + I)−1 Σxy + β tr(AT A) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
  • 36. Gaussian variables Can be shown (in both RIB formulations) that when (X, Y ) are jointly Gaussian, optimal T is also Gaussian Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p, and optimize over A For MMSE formulation, can rewrite objective as min A tr Σy − Σyx (AΣx AT + I)−1 Σxy + β tr(AT A) When Σx = σ2 x I, can show that optimal projection is A = 1 σx D1/2UT , where U has columns equal to top eigenvectors of Σxy Σyx and entries of D are di = λi β − 1 if λi ≥ β, 0 otherwise Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
  • 37. Gaussian variables For mutual information formulation, optimal choice of T corresponds to A = 1 σx (D−1 − I)1/2 UT , where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and entries of D are di = arg min 0≤d≤1 1 2 log d λi + 1 − γ log d + β d Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 15 / 26
  • 38. Gaussian variables For mutual information formulation, optimal choice of T corresponds to A = 1 σx (D−1 − I)1/2 UT , where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and entries of D are di = arg min 0≤d≤1 1 2 log d λi + 1 − γ log d + β d Again, larger β encourages d → 1, so scalings → 0 Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 15 / 26
  • 39. Variational optimization strategy How to optimize robust information bottleneck objective, in general? inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
  • 40. Variational optimization strategy How to optimize robust information bottleneck objective, in general? inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} “Deep variational information bottleneck,” Alemi et al. (’17) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
  • 41. Variational optimization strategy How to optimize robust information bottleneck objective, in general? inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} “Deep variational information bottleneck,” Alemi et al. (’17) Idea is to restrict optimization to certain class of conditional distributions pT|X Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
  • 42. Variational optimization strategy How to optimize robust information bottleneck objective, in general? inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} “Deep variational information bottleneck,” Alemi et al. (’17) Idea is to restrict optimization to certain class of conditional distributions pT|X Find minimizer of surrogate function which upper-bounds objective, using SGD Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
  • 43. Variational optimization strategy How to optimize robust information bottleneck objective, in general? inf T⊥⊥Y |X {−I(T; Y ) + γI(T; X) + βΦ(T|X)} “Deep variational information bottleneck,” Alemi et al. (’17) Idea is to restrict optimization to certain class of conditional distributions pT|X Find minimizer of surrogate function which upper-bounds objective, using SGD Builds upon idea of “variational autoencoders” (Kingma & Welling, ’13) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
  • 44. Variational optimization strategy Train encoder neural network to extract features T|X = x ∼ N(µ(x; θ), Σ(x, θ)) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
  • 45. Variational optimization strategy Train encoder neural network to extract features T|X = x ∼ N(µ(x; θ), Σ(x, θ)) For Φ(T|X) term, we can compute gradients explicitly: x log pT|X (t|x) = x   1 (2π)k k j=1 σ2 j (x) exp − k j=1 (tj − µj (x))2 2σj (x)2   = − k j=1 x σj (x) σj (x) + k j=1 (tj − µj (x)) σj (x)2 x µj (x) + k j=1 (tj − µj (x))2 σj (x)3 x σj (x) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
  • 46. Variational optimization strategy Train encoder neural network to extract features T|X = x ∼ N(µ(x; θ), Σ(x, θ)) For Φ(T|X) term, we can compute gradients explicitly: x log pT|X (t|x) = x   1 (2π)k k j=1 σ2 j (x) exp − k j=1 (tj − µj (x))2 2σj (x)2   = − k j=1 x σj (x) σj (x) + k j=1 (tj − µj (x)) σj (x)2 x µj (x) + k j=1 (tj − µj (x))2 σj (x)3 x σj (x) Importantly, Φ(T|X) (and its gradients) can be computed in terms of gradients of parameters θ = ({µj }, {σj }) with respect to input x Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
  • 47. Variational optimization strategy For mmse, we have the following upper bound for any ˜f : mmse(Y |T) ≤ E[(Y − ˜f (T))2 ] and we will train decoder network ˜f parametrized by φ Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
  • 48. Variational optimization strategy For mmse, we have the following upper bound for any ˜f : mmse(Y |T) ≤ E[(Y − ˜f (T))2 ] and we will train decoder network ˜f parametrized by φ Importantly, all terms in variational objective can be approximated from samples and the sum optimized using SGD Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
  • 49. Variational optimization strategy For mmse, we have the following upper bound for any ˜f : mmse(Y |T) ≤ E[(Y − ˜f (T))2 ] and we will train decoder network ˜f parametrized by φ Importantly, all terms in variational objective can be approximated from samples and the sum optimized using SGD ✓ µ1 µ2 21 ( ( x t ˆy ⇠ N(µ, ) µ1 µ2 21 x `(✓, , x, y, rxµ, rx ) rxµ, rx ✓ ( Encoder Decoder Loss function = Encoder copy to get derivatives Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
  • 50. Variational optimization strategy More details: E[(Y − ˜f (T))2 ] ≈ 1 n n i=1 (yi − ˜f (t; φ))2 pT|X (t|xi ; θ)dt Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
  • 51. Variational optimization strategy More details: E[(Y − ˜f (T))2 ] ≈ 1 n n i=1 (yi − ˜f (t; φ))2 pT|X (t|xi ; θ)dt Furthermore, we can use a reparametrization trick (Kingma & Welling ’13) to write latter integral as (yi − ˜f (τ(xi , ; θ); φ))2 p ( )d , where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
  • 52. Variational optimization strategy More details: E[(Y − ˜f (T))2 ] ≈ 1 n n i=1 (yi − ˜f (t; φ))2 pT|X (t|xi ; θ)dt Furthermore, we can use a reparametrization trick (Kingma & Welling ’13) to write latter integral as (yi − ˜f (τ(xi , ; θ); φ))2 p ( )d , where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I) Thus, we can easily generate stochastic gradients by plugging in standard normal variables and input data Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
  • 53. Variational optimization strategy Variational approximation for −I(T; Y ) + γI(T; X) is more complicated, but also involves training a decoder function ✓ µ1 µ2 21 ( ( x t ˆy ⇠ N(µ, ) µ1 µ2 21 x `(✓, , x, y, rxµ, rx ) rxµ, rx ✓ ( Encoder Decoder Loss function = Encoder copy to get derivatives Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 20 / 26
  • 54. Experiments (Gaussian mixture) Generate Y = ±1 with probability 1 2 each, and X|Y = 1 ∼ N 1 1 , 16 0 0 1 , X|Y = −1 ∼ N −1 −1 , 16 0 0 1 Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 21 / 26
  • 55. Experiments (Gaussian mixture) Generate Y = ±1 with probability 1 2 each, and X|Y = 1 ∼ N 1 1 , 16 0 0 1 , X|Y = −1 ∼ N −1 −1 , 16 0 0 1 Extract linear features: T = wT X + , where ∼ N(0, I) Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 21 / 26
  • 56. Experiments (Gaussian mixture) MMSE formulation: infw {mmse(Y |T) + βΦ(T|X)} As β increases, slope of w tilts and w 2 decreases Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 22 / 26
  • 57. Experiments (MNIST) ≈ 50, 000 training images, 10, 000 testing images, 28 × 28 pixels Two hidden layers of width 1024, output feature dimension k = 2 Decoder is logistic classifier with softmax Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 23 / 26
  • 58. Experiments (MNIST) Mutual information formulation: infw {−I(T; Y ) + γI(T; X) + βΦ(T|X)} with γ = 0 As β increases, accuracy of classifier decreases and Fisher information term decreases Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 24 / 26
  • 59. Contributions Presented new RIB method for extracting robust features Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
  • 60. Contributions Presented new RIB method for extracting robust features Derived rigorous theory for optimal features in Gaussian setting Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
  • 61. Contributions Presented new RIB method for extracting robust features Derived rigorous theory for optimal features in Gaussian setting Provided variational optimization method for extracting features in general case Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
  • 62. Contributions Presented new RIB method for extracting robust features Derived rigorous theory for optimal features in Gaussian setting Provided variational optimization method for extracting features in general case Preliminary experimental results are promising! Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
  • 63. Future work Show that this method is actually robust to adversarial (or average-case) perturbations . . . Explore pros and cons of mutual information vs. MMSE formulations Alternative methods for robust feature extraction? Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 26 / 26
  • 64. Future work Show that this method is actually robust to adversarial (or average-case) perturbations . . . Explore pros and cons of mutual information vs. MMSE formulations Alternative methods for robust feature extraction? Thank you! Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 26 / 26