We derive bounds for a notion of adversarial risk, designed to characterize the robustness of linear and neural network classifiers to adversarial perturbations. Specifically, we introduce a new class of function transformations with the property that the risk of the transformed functions upper-bounds the adversarial risk of the original functions. This reduces the problem of deriving bounds on the adversarial risk to the problem of deriving risk bounds using standard learning-theoretic techniques. We then derive bounds on the Rademacher complexities of the transformed function classes, obtaining error rates on the same order as the generalization error of the original function classes. We also discuss extensions of our theory to multiclass classification and regression. Finally, we provide two algorithms for optimizing the adversarial risk bounds in the linear case, and discuss connections to regularization and distributional robustness.
Deep Learning Opening Workshop - Robust Information Bottleneck - Poh-Ling Loh, August 12, 2019
1. Robust information bottleneck
Po-Ling Loh
University of Wisconsin
Columbia University
Department of Statistics
Program on deep learning opening workshop
SAMSI
August 12, 2019
Joint work with Varun Jog (UW-Madison)
With thanks to the Simons Institute!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 1 / 26
2. The problem
Despite remarkable success in prediction tasks, trained neural
networks are vulnerable to small, imperceptible perturbations
(Szegedy et al., ’13)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 2 / 26
3. The remedy(?)
Data augmentation (Simard et al., ’03)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
4. The remedy(?)
Data augmentation (Simard et al., ’03)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
5. The remedy(?)
Data augmentation (Simard et al., ’03)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
6. The remedy(?)
Data augmentation (Simard et al., ’03)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)
Gradients of loss with respect to x can be extracted from
backpropagation operation on network
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
7. The remedy(?)
Data augmentation (Simard et al., ’03)
Adversarial training using fast gradient sign method (FGSM):
xadv = x + sign ( x Lθ(x, y)) (Goodfellow et al., ’14)
Gradients of loss with respect to x can be extracted from
backpropagation operation on network
Randomized smoothing (Lecuyer et al., ’19)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 3 / 26
8. Tradeoffs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
9. Tradeoffs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small
Although classifier becomes more robust, accuracy may be
compromised for points near boundary
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
10. Tradeoffs between robustness and accuracy
Robustness: Want |frob(xi ) − frob(x )| < δ when xi − x <
Prediction accuracy: Want E[L(frob(xi ), yi )] to be small
Although classifier becomes more robust, accuracy may be
compromised for points near boundary
How to quantify this?
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 4 / 26
11. One idea: Robust feature extraction
Try to base predictions on features that are relatively insensitive to
perturbations on inputs
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 5 / 26
12. One idea: Robust feature extraction
Try to base predictions on features that are relatively insensitive to
perturbations on inputs
Example: Logistic regression classifier projects covariates to xT
i β,
but some β vectors may be more robust than others
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 5 / 26
13. One idea: Robust feature extraction
Example: Taken from “Robustness may be at odds with accuracy,”
Tsipras et al. (’18)
Yi ∼ {−1, +1}, x1 =
+y with prob. q,
−y with prob. 1 − q
,
x2, . . . , xp−1
i.i.d.
∼ N(ηy, 1)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 6 / 26
14. One idea: Robust feature extraction
Example: Taken from “Robustness may be at odds with accuracy,”
Tsipras et al. (’18)
Yi ∼ {−1, +1}, x1 =
+y with prob. q,
−y with prob. 1 − q
,
x2, . . . , xp−1
i.i.d.
∼ N(ηy, 1)
Another interesting method due to Garg et al. (’18) extracts features
using eigenvalues of graph Laplacian constructed over input space
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 6 / 26
15. Information bottleneck
Framework introduced by Tishby et al. (’99) for extracting features
that are both succinct and relevant
inf
T⊥⊥Y |X
{I(T; X) − βI(T; Y )}
Recall: I(T; X) = p(x, t) log p(x,t)
pX (x)pT (t) dxdt
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 7 / 26
16. Information bottleneck
Framework introduced by Tishby et al. (’99) for extracting features
that are both succinct and relevant
inf
T⊥⊥Y |X
{I(T; X) − βI(T; Y )}
Recall: I(T; X) = p(x, t) log p(x,t)
pX (x)pT (t) dxdt
Experiments reveal memorization phase where I(T; X) and I(T; Y )
initially increase, and compression phase where I(T; X) decreases and
I(T; Y ) increases (Schwartz-Ziv & Tishby, ’17)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 7 / 26
17. Information bottleneck
Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
18. Information bottleneck
Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)
Optimal set of features corresponds to linear projection T = AX +
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
19. Information bottleneck
Rigorous theory for form of extracted features derived when (X, Y )
are jointly Gaussian (Chechik et al., ’05)
Optimal set of features corresponds to linear projection T = AX +
Rows of A are eigenvectors of Σ−1
x Σxy Σ−1
y Σyx , rescaled by functions
of eigenvalues and tradeoff parameter
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 8 / 26
20. Fisher information regularization
Consider T as parametrized by X, and define
Φ(T|X) = x log pT|X (t|x) 2
2pT|X (t|x)dt pX (x)dx
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 9 / 26
21. Fisher information regularization
Consider T as parametrized by X, and define
Φ(T|X) = x log pT|X (t|x) 2
2pT|X (t|x)dt pX (x)dx
Small Φ =⇒ distribution of T is not too sensitive to changes in X
(on average)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 9 / 26
22. Robust information bottleneck
Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
23. Robust information bottleneck
Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Simpler(?) formulation using Φ(T|X) for regularization
inf
T⊥⊥Y |X
{mmse(Y |T) + βΦ(T|X)}
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
24. Robust information bottleneck
Mutual information formulation:
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Simpler(?) formulation using Φ(T|X) for regularization
inf
T⊥⊥Y |X
{mmse(Y |T) + βΦ(T|X)}
Here, define
mmse(Y |T) = E[(Y − E(Y |T))2
]
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 10 / 26
25. Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
26. Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs
Notably, both objective functions are invariant to scalings/bijective
transformations of T
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
27. Remarks
MMSE formulation may be preferable when Y is continuous, to
quantify closeness between predicted values and outputs
Notably, both objective functions are invariant to scalings/bijective
transformations of T
Thus, if we optimize over features of the form T = AX + , where
is Gaussian, we can assume ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 11 / 26
28. Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
29. Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Perturbations in KL divergence:
sup
∆ 2≤δ
KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
30. Properties of regularizer
Perturbations in mutual information:
I(X; T) − I(X +
√
δZ; T) =
δ
2
Φ(T|X) + o(δ),
where Z ∼ N(0, I)
Perturbations in KL divergence:
sup
∆ 2≤δ
KL(pT|X=x+∆ pT|X=x )pX (x)dx ≤ δΦ(T|X)
∴ small Φ(T|X) =⇒ small KL divergence between perturbed
distributions =⇒ small TV distance, Wasserstein distance, etc.
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 12 / 26
31. Properties of regularizer
When T = AX + ξ, where ξ ∼ N(0, I), we have
Φ(T|X) = A 2
F ,
so lower SNR =⇒ more robust features
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 13 / 26
32. Properties of regularizer
When T = AX + ξ, where ξ ∼ N(0, I), we have
Φ(T|X) = A 2
F ,
so lower SNR =⇒ more robust features
When T ∈ {+1, −1} with P(T = 1|X = x) = 1
1+exp(−xT w)
, we have
Φ(T|X = x) = w 2
2 · P(T = 1|X = x) · P(T = −1|X = x),
so regularizer encourages more confident predictions
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 13 / 26
33. Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
34. Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
35. Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A
For MMSE formulation, can rewrite objective as
min
A
tr Σy − Σyx (AΣx AT
+ I)−1
Σxy + β tr(AT
A)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
36. Gaussian variables
Can be shown (in both RIB formulations) that when (X, Y ) are
jointly Gaussian, optimal T is also Gaussian
Consequently, assume T = AX + , where ∼ N(0, I) and A ∈ Rk×p,
and optimize over A
For MMSE formulation, can rewrite objective as
min
A
tr Σy − Σyx (AΣx AT
+ I)−1
Σxy + β tr(AT
A)
When Σx = σ2
x I, can show that optimal projection is A = 1
σx
D1/2UT ,
where U has columns equal to top eigenvectors of Σxy Σyx and entries
of D are
di =
λi
β − 1 if λi ≥ β,
0 otherwise
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 14 / 26
37. Gaussian variables
For mutual information formulation, optimal choice of T corresponds
to
A =
1
σx
(D−1
− I)1/2
UT
,
where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and
entries of D are
di = arg min
0≤d≤1
1
2
log
d
λi
+ 1 − γ log d +
β
d
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 15 / 26
38. Gaussian variables
For mutual information formulation, optimal choice of T corresponds
to
A =
1
σx
(D−1
− I)1/2
UT
,
where U consists of top eigenvectors of (Σy − Σyx Σxy )−1/2Σyx and
entries of D are
di = arg min
0≤d≤1
1
2
log
d
λi
+ 1 − γ log d +
β
d
Again, larger β encourages d → 1, so scalings → 0
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 15 / 26
39. Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
40. Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
41. Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Idea is to restrict optimization to certain class of conditional
distributions pT|X
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
42. Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Idea is to restrict optimization to certain class of conditional
distributions pT|X
Find minimizer of surrogate function which upper-bounds objective,
using SGD
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
43. Variational optimization strategy
How to optimize robust information bottleneck objective, in general?
inf
T⊥⊥Y |X
{−I(T; Y ) + γI(T; X) + βΦ(T|X)}
“Deep variational information bottleneck,” Alemi et al. (’17)
Idea is to restrict optimization to certain class of conditional
distributions pT|X
Find minimizer of surrogate function which upper-bounds objective,
using SGD
Builds upon idea of “variational autoencoders” (Kingma & Welling,
’13)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 16 / 26
44. Variational optimization strategy
Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
45. Variational optimization strategy
Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))
For Φ(T|X) term, we can compute gradients explicitly:
x log pT|X (t|x) = x
1
(2π)k k
j=1 σ2
j (x)
exp −
k
j=1
(tj − µj (x))2
2σj (x)2
= −
k
j=1
x σj (x)
σj (x)
+
k
j=1
(tj − µj (x))
σj (x)2 x µj (x) +
k
j=1
(tj − µj (x))2
σj (x)3 x σj (x)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
46. Variational optimization strategy
Train encoder neural network to extract features
T|X = x ∼ N(µ(x; θ), Σ(x, θ))
For Φ(T|X) term, we can compute gradients explicitly:
x log pT|X (t|x) = x
1
(2π)k k
j=1 σ2
j (x)
exp −
k
j=1
(tj − µj (x))2
2σj (x)2
= −
k
j=1
x σj (x)
σj (x)
+
k
j=1
(tj − µj (x))
σj (x)2 x µj (x) +
k
j=1
(tj − µj (x))2
σj (x)3 x σj (x)
Importantly, Φ(T|X) (and its gradients) can be computed in terms of
gradients of parameters θ = ({µj }, {σj }) with respect to input x
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 17 / 26
47. Variational optimization strategy
For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
48. Variational optimization strategy
For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ
Importantly, all terms in variational objective can be approximated
from samples and the sum optimized using SGD
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
49. Variational optimization strategy
For mmse, we have the following upper bound for any ˜f :
mmse(Y |T) ≤ E[(Y − ˜f (T))2
]
and we will train decoder network ˜f parametrized by φ
Importantly, all terms in variational objective can be approximated
from samples and the sum optimized using SGD
✓
µ1 µ2 21
(
(
x
t
ˆy
⇠ N(µ, )
µ1 µ2 21
x
`(✓, , x, y, rxµ, rx )
rxµ, rx
✓
(
Encoder
Decoder
Loss function =
Encoder copy to get derivatives
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 18 / 26
50. Variational optimization strategy
More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
51. Variational optimization strategy
More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Furthermore, we can use a reparametrization trick (Kingma & Welling
’13) to write latter integral as
(yi − ˜f (τ(xi , ; θ); φ))2
p ( )d ,
where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
52. Variational optimization strategy
More details:
E[(Y − ˜f (T))2
] ≈
1
n
n
i=1
(yi − ˜f (t; φ))2
pT|X (t|xi ; θ)dt
Furthermore, we can use a reparametrization trick (Kingma & Welling
’13) to write latter integral as
(yi − ˜f (τ(xi , ; θ); φ))2
p ( )d ,
where τ(x, ; θ) = µ(x; θ) + Σ(x; θ)1/2 and ∼ N(0, I)
Thus, we can easily generate stochastic gradients by plugging in
standard normal variables and input data
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 19 / 26
53. Variational optimization strategy
Variational approximation for −I(T; Y ) + γI(T; X) is more
complicated, but also involves training a decoder function
✓
µ1 µ2 21
(
(
x
t
ˆy
⇠ N(µ, )
µ1 µ2 21
x
`(✓, , x, y, rxµ, rx )
rxµ, rx
✓
(
Encoder
Decoder
Loss function =
Encoder copy to get derivatives
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 20 / 26
54. Experiments (Gaussian mixture)
Generate Y = ±1 with probability 1
2 each, and
X|Y = 1 ∼ N
1
1
,
16 0
0 1
, X|Y = −1 ∼ N
−1
−1
,
16 0
0 1
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 21 / 26
55. Experiments (Gaussian mixture)
Generate Y = ±1 with probability 1
2 each, and
X|Y = 1 ∼ N
1
1
,
16 0
0 1
, X|Y = −1 ∼ N
−1
−1
,
16 0
0 1
Extract linear features: T = wT X + , where ∼ N(0, I)
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 21 / 26
56. Experiments (Gaussian mixture)
MMSE formulation: infw {mmse(Y |T) + βΦ(T|X)}
As β increases, slope of w tilts and w 2 decreases
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 22 / 26
57. Experiments (MNIST)
≈ 50, 000 training images, 10, 000 testing images, 28 × 28 pixels
Two hidden layers of width 1024, output feature dimension k = 2
Decoder is logistic classifier with softmax
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 23 / 26
58. Experiments (MNIST)
Mutual information formulation:
infw {−I(T; Y ) + γI(T; X) + βΦ(T|X)} with γ = 0
As β increases, accuracy of classifier decreases and Fisher information
term decreases
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 24 / 26
59. Contributions
Presented new RIB method for extracting robust features
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
60. Contributions
Presented new RIB method for extracting robust features
Derived rigorous theory for optimal features in Gaussian setting
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
61. Contributions
Presented new RIB method for extracting robust features
Derived rigorous theory for optimal features in Gaussian setting
Provided variational optimization method for extracting features in
general case
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
62. Contributions
Presented new RIB method for extracting robust features
Derived rigorous theory for optimal features in Gaussian setting
Provided variational optimization method for extracting features in
general case
Preliminary experimental results are promising!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 25 / 26
63. Future work
Show that this method is actually robust to adversarial (or
average-case) perturbations . . .
Explore pros and cons of mutual information vs. MMSE formulations
Alternative methods for robust feature extraction?
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 26 / 26
64. Future work
Show that this method is actually robust to adversarial (or
average-case) perturbations . . .
Explore pros and cons of mutual information vs. MMSE formulations
Alternative methods for robust feature extraction?
Thank you!
Po-Ling Loh (UW-Madison/Columbia) Robust information bottleneck Aug 12, 2019 26 / 26