Statistical inference using
stochastic gradient descent
Constantine Caramanis1
Liu Liu1 Anastasios (Tasos) Kyrillidis2 Tianyang Li1
1The University of Texas at Austin
2IBM T.J. Watson Research Center, Yorktown Heights → Rice University
Statistical inference is important
Quantifying uncertainty
Signal? Noise?
Skill? Luck?
Frequentist inference
confidence interval
hypothesis testing
Statistical inference is important
Quantifying uncertainty
Signal? Noise?
Skill? Luck?
Frequentist inference
confidence interval
hypothesis testing
Confidence intervals can be used to detect adversarial
attacks.
Outline of This Work
(a) Large Scale Problems: Point Estimates computed via SGD
(b) Confidence Intervals computed by Boostrap: too expensive.
(c) This talk: we can compute using SGD.
(d) Application to adversarial attacks: implicitly learning the
manifold.
SGD in ERM – mini batch SGD
To solve empirical risk minimization (ERM)
f (θ) =
1
n
n
i=1
fi (θ),
where fi (θ) = θ(Zi ).
At each step:
Draw S i.i.d. uniformly random indices It from [n] (with
replacement)
Compute stochastic gradient gs(θt) = 1
S i∈It
fi (θt)
θt+1 = θt − ηgs(θt)
Asymptotic normality – classical results
M-estimator – statistics
When number of samples n → ∞,
√
n(θ − θ∗
) N(0, H∗−1
G∗
H∗−1
),
where G∗ = EZ [ θ θ∗ (Z) θ θ∗ (Z) ] and H∗ = EZ [ 2
θ θ∗ (Z)].
Stochastic approximation – optimization
When number of steps t → ∞,
√
t
1
t
t
i=1
θt − θ N(0, H−1
GH−1
),
where G = E[gs(θ)gs(θ) |= θ] and H = 2f (θ).
Asymptotic normality – classical results
M-estimator – statistics
When number of samples n → ∞,
√
n(θ − θ∗
) N(0, H∗−1
G∗
H∗−1
),
where G∗ = EZ [ θ θ∗ (Z) θ θ∗ (Z) ] and H∗ = EZ [ 2
θ θ∗ (Z)].
Stochastic approximation – optimization
When number of steps t → ∞,
√
t
1
t
t
i=1
θt − θ N(0, H−1
GH−1
),
where G = E[gs(θ)gs(θ) |= θ] and H = 2f (θ).
SGD not only useful for optimization,
but also useful for statistical inference!
Statistical inference using mini batch SGD
burn in
θ−b, θ−b+1, · · · θ−1, θ0,
¯θ
(i)
t =1
t
t
j=1 θ
(i)
j
θ
(1)
1 , θ
(1)
2 , · · · , θ
(1)
t
discarded
θ
(1)
t+1, θ
(1)
t+2, · · · , θ
(1)
t+d
θ
(2)
1 , θ
(2)
2 , · · · , θ
(2)
t θ
(2)
t+1, θ
(2)
t+2, · · · , θ
(2)
t+d
...
θ
(R)
1 , θ
(R)
2 , · · · , θ
(R)
t θ
(R)
t+1, θ
(R)
t+2, · · · , θ
(R)
t+d
At each step:
Draw S i.i.d. uniformly random
indices It from [n] (with replacement)
Compute stochastic gradient
gs(θt) = 1
S i∈It
fi (θt)
θt+1 = θt − ηgs(θt)
Use an ensemble of i = 1, 2, . . . , R estima-
tors for statistical inference:
θ(i)
= θ +
√
S
√
t
√
n
(¯θ
(i)
t − θ).
Advantages of SGD inference
empirically not more expensive, uses
many fewer operations than
bootstrap
can be used when training neural
networks with SGD
easy to plug into existing SGD code
Other statistical inference
methods
directly computing inverse
Fisher information matrix
resampling:
bootstrap, subsampling
Advantages of SGD inference
empirically not more expensive, uses
many fewer operations than
bootstrap
can be used when training neural
networks with SGD
easy to plug into existing SGD code
Other statistical inference
methods
directly computing inverse
Fisher information matrix
resampling:
bootstrap, subsampling
Too computationally expensive,
not suited for “big data”!
Intuition – Ornstein-Uhlenbeck process approximation
In SGD, denote ∆t = θt − θ, and we have
∆t+1 = ∆t − ηgs(θ + ∆t).
∆t can be approximated by the Ornstein-Uhlenbeck process
d∆(T) = −H∆ dT +
√
ηG
1
2 dB(T),
where B(T) is a standard Brownian motion.
Intuition – Ornstein-Uhlenbeck process approximation
Denote ¯θt = 1
t
t
i=1 θt.
√
t(¯θt − θ) can be approximated as
√
t(¯θt − θ) = 1√
t
t
i=1
(θi − θ)
= 1
η
√
t
t
i=1
(θi − θ)η ≈ 1
η
√
t
tη
0
∆(T) dT,
(1)
where we use the approximation that η ≈ dT. By rearranging terms and multiplying both sides by H−1,
we can rewrite the stochastic differential equation as ∆(T) dT = −H−1 d∆(T) +
√
ηH−1G
1
2 dB(T).
Thus, we have
tη
0
∆(T) dT = −H−1
(∆(tη) − ∆(0)) +
√
ηH−1
G
1
2 B(tη). (2)
After plugging (2) into (1) we have
√
t ¯θt − θ ≈ − 1
η
√
t
H−1
(∆(tη) − ∆(0)) + 1√
tη
H−1
G
1
2 B(tη).
When ∆(0) = 0, the variance Var −1/η
√
t · H−1 (∆(tη) − ∆(0)) = O (1/tη). Since 1/√
tη ·
H−1G
1
2 B(tη) ∼ N(0, H−1GH−1), when η → 0 and ηt → ∞, we conclude that
√
t(¯θt − θ) ∼ N(0, H−1
GH−1
).
Theoretical guarantee
Theorem
For a differentiable convex function f (θ) = 1
n
n
i=1 fi (θ), with gradient f (θ), let θ ∈ Rp be
its minimizer, and denote its Hessian at θ by H := 2f (θ) . Assume that ∀θ ∈ Rp, f satisfies:
(F1) Weak strong convexity: (θ − θ) f (θ) ≥ α θ − θ 2
2, for constant α > 0,
(F2) Lipschitz gradient continuity: f (θ) 2 ≤ L θ − θ 2, for constant L > 0,
(F3) Bounded Taylor remainder: f (θ) − H(θ − θ) 2 ≤ E θ − θ 2
2, for constant E > 0,
(F4) Bounded Hessian spectrum at θ: 0 < λL ≤ λi (H) ≤ λU < ∞, ∀i.
Furthermore, let gs(θ) be a stochastic gradient of f , satisfying:
(G1) E [gs(θ) | θ] = f (θ),
(G2) E gs(θ) 2
2 | θ ≤ A θ − θ 2
2 + B,
(G3) E gs(θ) 4
2 | θ ≤ C θ − θ 4
2 + D,
(G4) E gs(θ)gs(θ) | θ − G 2
≤ A1 θ − θ 2 + A2 θ − θ 2
2 + A3 θ − θ 3
2 + A4 θ − θ 4
2,
for positive, data dependent constants A, B, C, D, Ai , for i = 1, . . . , 4. Assume that
θ1 − θ 2
2 = O(η); then for sufficiently small step size η > 0, the average SGD sequence
θt = 1
t
n
i=1 θi satisfies:
tE[(¯θt − θ)(¯θt − θ) ] − H−1
GH−1
2
√
η + 1
tη + tη2,
where G = E[gs(θ)gs(θ) | θ].
Theoretical guarantee
Theorem
For a differentiable convex function f (θ) = 1
n
n
i=1 fi (θ), with gradient f (θ), let θ ∈ Rp be
its minimizer, and denote its Hessian at θ by H := 2f (θ) . Assume that ∀θ ∈ Rp, f satisfies:
(F1) Weak strong convexity: (θ − θ) f (θ) ≥ α θ − θ 2
2, for constant α > 0,
(F2) Lipschitz gradient continuity: f (θ) 2 ≤ L θ − θ 2, for constant L > 0,
(F3) Bounded Taylor remainder: f (θ) − H(θ − θ) 2 ≤ E θ − θ 2
2, for constant E > 0,
(F4) Bounded Hessian spectrum at θ: 0 < λL ≤ λi (H) ≤ λU < ∞, ∀i.
Furthermore, let gs(θ) be a stochastic gradient of f , satisfying:
(G1) E [gs(θ) | θ] = f (θ),
(G2) E gs(θ) 2
2 | θ ≤ A θ − θ 2
2 + B,
(G3) E gs(θ) 4
2 | θ ≤ C θ − θ 4
2 + D,
(G4) E gs(θ)gs(θ) | θ − G 2
≤ A1 θ − θ 2 + A2 θ − θ 2
2 + A3 θ − θ 3
2 + A4 θ − θ 4
2,
for positive, data dependent constants A, B, C, D, Ai , for i = 1, . . . , 4. Assume that
θ1 − θ 2
2 = O(η); then for sufficiently small step size η > 0, the average SGD sequence
θt = 1
t
n
i=1 θi satisfies:
tE[(¯θt − θ)(¯θt − θ) ] − H−1
GH−1
2
√
η + 1
tη + tη2,
where G = E[gs(θ)gs(θ) | θ].
Proof idea: H−1 = η i≥0(I − ηH)i
Comparison with bootstrap
Univariate model estimation
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
0.0
0.5
1.0
1.5
2.0
N(0, 1/n)
θSGD − ¯θSGD
θbootstrap − ¯θbootstrap
(a) Normal
1√
2π
exp(−(x−µ)2
2
)
µ = 0
0.8 1.0 1.2 1.4 1.6
0
1
2
3
4
SGD
bootstrap
(b) Exponential
µe−µx
µ = 1
0.8 1.0 1.2 1.4
0
1
2
3
4
5
SGD
bootstrap
(c) Poisson
µx
e−µx
x!
µ = 1
95% confidence interval coverage simulation
η t = 100 t = 500 t = 2500
0.1 (0.957, 4.41) (0.955, 4.51) (0.960, 4.53)
0.02 (0.869, 3.30) (0.923, 3.77) (0.918, 3.87)
0.004 (0.634, 2.01) (0.862, 3.20) (0.916, 3.70)
(a) Bootstrap (0.941, 4.14), normal approximation (0.928, 3.87)
η t = 100 t = 500 t = 2500
0.1 (0.949, 4.74) (0.962, 4.91) (0.963, 4.94)
0.02 (0.845, 3.37) (0.916, 4.01) (0.927, 4.17)
0.004 (0.616, 2.00) (0.832, 3.30) (0.897, 3.93)
(b) Bootstrap (0.938, 4.47), normal approximation (0.925, 4.18)
Table 1: Linear regression: dimension = 10, 100 samples. (a) diagonal
covariance (b) non-diagonal covariance
η t = 100 t = 500 t = 2500
0.1 (0.872, 0.204) (0.937, 0.249) (0.939, 0.258)
0.02 (0.610, 0.112) (0.871, 0.196) (0.926, 0.237)
0.004 (0.312, 0.051) (0.596, 0.111) (0.86, 0.194)
(a) Bootstrap (0.932, 0.253), normal approximation (0.957, 0.264)
η t = 100 t = 500 t = 2500
0.1 (0.859, 0.206) (0.931, 0.255) (0.947, 0.266)
0.02 (0.600, 0.112) (0.847, 0.197) (0.931, 0.244)
0.004 (0.302, 0.051) (0.583, 0.111) (0.851, 0.195)
(b) Bootstrap (0.932, 0.245), normal approximation (0.954, 0.256)
Table 2: Logistic regression: dimension = 10, 1000 samples. (a) diagonal
covariance (b) non-diagonal covariance
Better when
each replicate’s average uses a longer consecutive sequence
larger step size
(coverage probability, confidence interval width)
Adversarial Attacks
Neural network classifiers with very high accuracy on test sets are
extremely susceptible to nearly imperceptible adversarial attacks.
Adversarial Attacks
Adversarial Attacks
Confidence intervals for mitigating adversarial examples
MNIST – logistic regression
0 5 10 15 20 25
0
5
10
15
20
25
(b) Original “0”:
P{0 | image} ≈ 1 − e−46
CI ≈ (1 − e−28
, 1 − e−64
)
0 5 10 15 20 25
0
5
10
15
20
25
(c) Adversarial “0”:
P{0 | image} ≈ e−17
CI ≈ (e−31
, 1 − e−11
)
0 5 10 15 20 25
0
5
10
15
20
25
Figure 1: MNIST adversarial perturbation
(scaled for display)
Adversarial examples produced by gradient attack have
large confidence intervals!

Statistical Inference Using Stochastic Gradient Descent

  • 1.
    Statistical inference using stochasticgradient descent Constantine Caramanis1 Liu Liu1 Anastasios (Tasos) Kyrillidis2 Tianyang Li1 1The University of Texas at Austin 2IBM T.J. Watson Research Center, Yorktown Heights → Rice University
  • 2.
    Statistical inference isimportant Quantifying uncertainty Signal? Noise? Skill? Luck? Frequentist inference confidence interval hypothesis testing
  • 3.
    Statistical inference isimportant Quantifying uncertainty Signal? Noise? Skill? Luck? Frequentist inference confidence interval hypothesis testing Confidence intervals can be used to detect adversarial attacks.
  • 4.
    Outline of ThisWork (a) Large Scale Problems: Point Estimates computed via SGD (b) Confidence Intervals computed by Boostrap: too expensive. (c) This talk: we can compute using SGD. (d) Application to adversarial attacks: implicitly learning the manifold.
  • 5.
    SGD in ERM– mini batch SGD To solve empirical risk minimization (ERM) f (θ) = 1 n n i=1 fi (θ), where fi (θ) = θ(Zi ). At each step: Draw S i.i.d. uniformly random indices It from [n] (with replacement) Compute stochastic gradient gs(θt) = 1 S i∈It fi (θt) θt+1 = θt − ηgs(θt)
  • 6.
    Asymptotic normality –classical results M-estimator – statistics When number of samples n → ∞, √ n(θ − θ∗ ) N(0, H∗−1 G∗ H∗−1 ), where G∗ = EZ [ θ θ∗ (Z) θ θ∗ (Z) ] and H∗ = EZ [ 2 θ θ∗ (Z)]. Stochastic approximation – optimization When number of steps t → ∞, √ t 1 t t i=1 θt − θ N(0, H−1 GH−1 ), where G = E[gs(θ)gs(θ) |= θ] and H = 2f (θ).
  • 7.
    Asymptotic normality –classical results M-estimator – statistics When number of samples n → ∞, √ n(θ − θ∗ ) N(0, H∗−1 G∗ H∗−1 ), where G∗ = EZ [ θ θ∗ (Z) θ θ∗ (Z) ] and H∗ = EZ [ 2 θ θ∗ (Z)]. Stochastic approximation – optimization When number of steps t → ∞, √ t 1 t t i=1 θt − θ N(0, H−1 GH−1 ), where G = E[gs(θ)gs(θ) |= θ] and H = 2f (θ). SGD not only useful for optimization, but also useful for statistical inference!
  • 8.
    Statistical inference usingmini batch SGD burn in θ−b, θ−b+1, · · · θ−1, θ0, ¯θ (i) t =1 t t j=1 θ (i) j θ (1) 1 , θ (1) 2 , · · · , θ (1) t discarded θ (1) t+1, θ (1) t+2, · · · , θ (1) t+d θ (2) 1 , θ (2) 2 , · · · , θ (2) t θ (2) t+1, θ (2) t+2, · · · , θ (2) t+d ... θ (R) 1 , θ (R) 2 , · · · , θ (R) t θ (R) t+1, θ (R) t+2, · · · , θ (R) t+d At each step: Draw S i.i.d. uniformly random indices It from [n] (with replacement) Compute stochastic gradient gs(θt) = 1 S i∈It fi (θt) θt+1 = θt − ηgs(θt) Use an ensemble of i = 1, 2, . . . , R estima- tors for statistical inference: θ(i) = θ + √ S √ t √ n (¯θ (i) t − θ).
  • 9.
    Advantages of SGDinference empirically not more expensive, uses many fewer operations than bootstrap can be used when training neural networks with SGD easy to plug into existing SGD code Other statistical inference methods directly computing inverse Fisher information matrix resampling: bootstrap, subsampling
  • 10.
    Advantages of SGDinference empirically not more expensive, uses many fewer operations than bootstrap can be used when training neural networks with SGD easy to plug into existing SGD code Other statistical inference methods directly computing inverse Fisher information matrix resampling: bootstrap, subsampling Too computationally expensive, not suited for “big data”!
  • 11.
    Intuition – Ornstein-Uhlenbeckprocess approximation In SGD, denote ∆t = θt − θ, and we have ∆t+1 = ∆t − ηgs(θ + ∆t). ∆t can be approximated by the Ornstein-Uhlenbeck process d∆(T) = −H∆ dT + √ ηG 1 2 dB(T), where B(T) is a standard Brownian motion.
  • 12.
    Intuition – Ornstein-Uhlenbeckprocess approximation Denote ¯θt = 1 t t i=1 θt. √ t(¯θt − θ) can be approximated as √ t(¯θt − θ) = 1√ t t i=1 (θi − θ) = 1 η √ t t i=1 (θi − θ)η ≈ 1 η √ t tη 0 ∆(T) dT, (1) where we use the approximation that η ≈ dT. By rearranging terms and multiplying both sides by H−1, we can rewrite the stochastic differential equation as ∆(T) dT = −H−1 d∆(T) + √ ηH−1G 1 2 dB(T). Thus, we have tη 0 ∆(T) dT = −H−1 (∆(tη) − ∆(0)) + √ ηH−1 G 1 2 B(tη). (2) After plugging (2) into (1) we have √ t ¯θt − θ ≈ − 1 η √ t H−1 (∆(tη) − ∆(0)) + 1√ tη H−1 G 1 2 B(tη). When ∆(0) = 0, the variance Var −1/η √ t · H−1 (∆(tη) − ∆(0)) = O (1/tη). Since 1/√ tη · H−1G 1 2 B(tη) ∼ N(0, H−1GH−1), when η → 0 and ηt → ∞, we conclude that √ t(¯θt − θ) ∼ N(0, H−1 GH−1 ).
  • 13.
    Theoretical guarantee Theorem For adifferentiable convex function f (θ) = 1 n n i=1 fi (θ), with gradient f (θ), let θ ∈ Rp be its minimizer, and denote its Hessian at θ by H := 2f (θ) . Assume that ∀θ ∈ Rp, f satisfies: (F1) Weak strong convexity: (θ − θ) f (θ) ≥ α θ − θ 2 2, for constant α > 0, (F2) Lipschitz gradient continuity: f (θ) 2 ≤ L θ − θ 2, for constant L > 0, (F3) Bounded Taylor remainder: f (θ) − H(θ − θ) 2 ≤ E θ − θ 2 2, for constant E > 0, (F4) Bounded Hessian spectrum at θ: 0 < λL ≤ λi (H) ≤ λU < ∞, ∀i. Furthermore, let gs(θ) be a stochastic gradient of f , satisfying: (G1) E [gs(θ) | θ] = f (θ), (G2) E gs(θ) 2 2 | θ ≤ A θ − θ 2 2 + B, (G3) E gs(θ) 4 2 | θ ≤ C θ − θ 4 2 + D, (G4) E gs(θ)gs(θ) | θ − G 2 ≤ A1 θ − θ 2 + A2 θ − θ 2 2 + A3 θ − θ 3 2 + A4 θ − θ 4 2, for positive, data dependent constants A, B, C, D, Ai , for i = 1, . . . , 4. Assume that θ1 − θ 2 2 = O(η); then for sufficiently small step size η > 0, the average SGD sequence θt = 1 t n i=1 θi satisfies: tE[(¯θt − θ)(¯θt − θ) ] − H−1 GH−1 2 √ η + 1 tη + tη2, where G = E[gs(θ)gs(θ) | θ].
  • 14.
    Theoretical guarantee Theorem For adifferentiable convex function f (θ) = 1 n n i=1 fi (θ), with gradient f (θ), let θ ∈ Rp be its minimizer, and denote its Hessian at θ by H := 2f (θ) . Assume that ∀θ ∈ Rp, f satisfies: (F1) Weak strong convexity: (θ − θ) f (θ) ≥ α θ − θ 2 2, for constant α > 0, (F2) Lipschitz gradient continuity: f (θ) 2 ≤ L θ − θ 2, for constant L > 0, (F3) Bounded Taylor remainder: f (θ) − H(θ − θ) 2 ≤ E θ − θ 2 2, for constant E > 0, (F4) Bounded Hessian spectrum at θ: 0 < λL ≤ λi (H) ≤ λU < ∞, ∀i. Furthermore, let gs(θ) be a stochastic gradient of f , satisfying: (G1) E [gs(θ) | θ] = f (θ), (G2) E gs(θ) 2 2 | θ ≤ A θ − θ 2 2 + B, (G3) E gs(θ) 4 2 | θ ≤ C θ − θ 4 2 + D, (G4) E gs(θ)gs(θ) | θ − G 2 ≤ A1 θ − θ 2 + A2 θ − θ 2 2 + A3 θ − θ 3 2 + A4 θ − θ 4 2, for positive, data dependent constants A, B, C, D, Ai , for i = 1, . . . , 4. Assume that θ1 − θ 2 2 = O(η); then for sufficiently small step size η > 0, the average SGD sequence θt = 1 t n i=1 θi satisfies: tE[(¯θt − θ)(¯θt − θ) ] − H−1 GH−1 2 √ η + 1 tη + tη2, where G = E[gs(θ)gs(θ) | θ]. Proof idea: H−1 = η i≥0(I − ηH)i
  • 15.
    Comparison with bootstrap Univariatemodel estimation −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 N(0, 1/n) θSGD − ¯θSGD θbootstrap − ¯θbootstrap (a) Normal 1√ 2π exp(−(x−µ)2 2 ) µ = 0 0.8 1.0 1.2 1.4 1.6 0 1 2 3 4 SGD bootstrap (b) Exponential µe−µx µ = 1 0.8 1.0 1.2 1.4 0 1 2 3 4 5 SGD bootstrap (c) Poisson µx e−µx x! µ = 1
  • 16.
    95% confidence intervalcoverage simulation η t = 100 t = 500 t = 2500 0.1 (0.957, 4.41) (0.955, 4.51) (0.960, 4.53) 0.02 (0.869, 3.30) (0.923, 3.77) (0.918, 3.87) 0.004 (0.634, 2.01) (0.862, 3.20) (0.916, 3.70) (a) Bootstrap (0.941, 4.14), normal approximation (0.928, 3.87) η t = 100 t = 500 t = 2500 0.1 (0.949, 4.74) (0.962, 4.91) (0.963, 4.94) 0.02 (0.845, 3.37) (0.916, 4.01) (0.927, 4.17) 0.004 (0.616, 2.00) (0.832, 3.30) (0.897, 3.93) (b) Bootstrap (0.938, 4.47), normal approximation (0.925, 4.18) Table 1: Linear regression: dimension = 10, 100 samples. (a) diagonal covariance (b) non-diagonal covariance η t = 100 t = 500 t = 2500 0.1 (0.872, 0.204) (0.937, 0.249) (0.939, 0.258) 0.02 (0.610, 0.112) (0.871, 0.196) (0.926, 0.237) 0.004 (0.312, 0.051) (0.596, 0.111) (0.86, 0.194) (a) Bootstrap (0.932, 0.253), normal approximation (0.957, 0.264) η t = 100 t = 500 t = 2500 0.1 (0.859, 0.206) (0.931, 0.255) (0.947, 0.266) 0.02 (0.600, 0.112) (0.847, 0.197) (0.931, 0.244) 0.004 (0.302, 0.051) (0.583, 0.111) (0.851, 0.195) (b) Bootstrap (0.932, 0.245), normal approximation (0.954, 0.256) Table 2: Logistic regression: dimension = 10, 1000 samples. (a) diagonal covariance (b) non-diagonal covariance Better when each replicate’s average uses a longer consecutive sequence larger step size (coverage probability, confidence interval width)
  • 17.
    Adversarial Attacks Neural networkclassifiers with very high accuracy on test sets are extremely susceptible to nearly imperceptible adversarial attacks.
  • 18.
  • 19.
  • 20.
    Confidence intervals formitigating adversarial examples MNIST – logistic regression 0 5 10 15 20 25 0 5 10 15 20 25 (b) Original “0”: P{0 | image} ≈ 1 − e−46 CI ≈ (1 − e−28 , 1 − e−64 ) 0 5 10 15 20 25 0 5 10 15 20 25 (c) Adversarial “0”: P{0 | image} ≈ e−17 CI ≈ (e−31 , 1 − e−11 ) 0 5 10 15 20 25 0 5 10 15 20 25 Figure 1: MNIST adversarial perturbation (scaled for display) Adversarial examples produced by gradient attack have large confidence intervals!