Statistical Inference Using Stochastic Gradient Descent

Statistical inference using
stochastic gradient descent
Constantine Caramanis1
Liu Liu1 Anastasios (Tasos) Kyrillidis2 Tianyang Li1
1The University of Texas at Austin
2IBM T.J. Watson Research Center, Yorktown Heights → Rice University

Statistical inference is important
Quantifying uncertainty
Signal? Noise?
Skill? Luck?
Frequentist inference
conﬁdence interval
hypothesis testing

Statistical inference is important
Quantifying uncertainty
Signal? Noise?
Skill? Luck?
Frequentist inference
conﬁdence interval
hypothesis testing
Conﬁdence intervals can be used to detect adversarial
attacks.

Outline of This Work
(a) Large Scale Problems: Point Estimates computed via SGD
(b) Conﬁdence Intervals computed by Boostrap: too expensive.
(c) This talk: we can compute using SGD.
(d) Application to adversarial attacks: implicitly learning the
manifold.

SGD in ERM – mini batch SGD
To solve empirical risk minimization (ERM)
f (θ) =
1
n
n
i=1
fi (θ),
where fi (θ) = θ(Zi ).
At each step:
Draw S i.i.d. uniformly random indices It from [n] (with
replacement)
Compute stochastic gradient gs(θt) = 1
S i∈It
fi (θt)
θt+1 = θt − ηgs(θt)

Asymptotic normality – classical results
M-estimator – statistics
When number of samples n → ∞,
√
n(θ − θ∗
) N(0, H∗−1
G∗
H∗−1
),
where G∗ = EZ [ θ θ∗ (Z) θ θ∗ (Z) ] and H∗ = EZ [ 2
θ θ∗ (Z)].
Stochastic approximation – optimization
When number of steps t → ∞,
√
t
1
t
t
i=1
θt − θ N(0, H−1
GH−1
),
where G = E[gs(θ)gs(θ) |= θ] and H = 2f (θ).

Asymptotic normality – classical results
M-estimator – statistics
When number of samples n → ∞,
√
n(θ − θ∗
) N(0, H∗−1
G∗
H∗−1
),
where G∗ = EZ [ θ θ∗ (Z) θ θ∗ (Z) ] and H∗ = EZ [ 2
θ θ∗ (Z)].
Stochastic approximation – optimization
When number of steps t → ∞,
√
t
1
t
t
i=1
θt − θ N(0, H−1
GH−1
),
where G = E[gs(θ)gs(θ) |= θ] and H = 2f (θ).
SGD not only useful for optimization,
but also useful for statistical inference!

Statistical inference using mini batch SGD
burn in
θ−b, θ−b+1, · · · θ−1, θ0,
¯θ
(i)
t =1
t
t
j=1 θ
(i)
j
θ
(1)
1 , θ
(1)
2 , · · · , θ
(1)
t
discarded
θ
(1)
t+1, θ
(1)
t+2, · · · , θ
(1)
t+d
θ
(2)
1 , θ
(2)
2 , · · · , θ
(2)
t θ
(2)
t+1, θ
(2)
t+2, · · · , θ
(2)
t+d
...
θ
(R)
1 , θ
(R)
2 , · · · , θ
(R)
t θ
(R)
t+1, θ
(R)
t+2, · · · , θ
(R)
t+d
At each step:
Draw S i.i.d. uniformly random
indices It from [n] (with replacement)
Compute stochastic gradient
gs(θt) = 1
S i∈It
fi (θt)
θt+1 = θt − ηgs(θt)
Use an ensemble of i = 1, 2, . . . , R estima-
tors for statistical inference:
θ(i)
= θ +
√
S
√
t
√
n
(¯θ
(i)
t − θ).

Advantages of SGD inference
empirically not more expensive, uses
many fewer operations than
bootstrap
can be used when training neural
networks with SGD
easy to plug into existing SGD code
Other statistical inference
methods
directly computing inverse
Fisher information matrix
resampling:
bootstrap, subsampling

Advantages of SGD inference
empirically not more expensive, uses
many fewer operations than
bootstrap
can be used when training neural
networks with SGD
easy to plug into existing SGD code
Other statistical inference
methods
directly computing inverse
Fisher information matrix
resampling:
bootstrap, subsampling
Too computationally expensive,
not suited for “big data”!

Intuition – Ornstein-Uhlenbeck process approximation
In SGD, denote ∆t = θt − θ, and we have
∆t+1 = ∆t − ηgs(θ + ∆t).
∆t can be approximated by the Ornstein-Uhlenbeck process
d∆(T) = −H∆ dT +
√
ηG
1
2 dB(T),
where B(T) is a standard Brownian motion.

Intuition – Ornstein-Uhlenbeck process approximation
Denote ¯θt = 1
t
t
i=1 θt.
√
t(¯θt − θ) can be approximated as
√
t(¯θt − θ) = 1√
t
t
i=1
(θi − θ)
= 1
η
√
t
t
i=1
(θi − θ)η ≈ 1
η
√
t
tη
0
∆(T) dT,
(1)
where we use the approximation that η ≈ dT. By rearranging terms and multiplying both sides by H−1,
we can rewrite the stochastic diﬀerential equation as ∆(T) dT = −H−1 d∆(T) +
√
ηH−1G
1
2 dB(T).
Thus, we have
tη
0
∆(T) dT = −H−1
(∆(tη) − ∆(0)) +
√
ηH−1
G
1
2 B(tη). (2)
After plugging (2) into (1) we have
√
t ¯θt − θ ≈ − 1
η
√
t
H−1
(∆(tη) − ∆(0)) + 1√
tη
H−1
G
1
2 B(tη).
When ∆(0) = 0, the variance Var −1/η
√
t · H−1 (∆(tη) − ∆(0)) = O (1/tη). Since 1/√
tη ·
H−1G
1
2 B(tη) ∼ N(0, H−1GH−1), when η → 0 and ηt → ∞, we conclude that
√
t(¯θt − θ) ∼ N(0, H−1
GH−1
).

Theoretical guarantee
Theorem
For a differentiable convex function f (θ) = 1
n
n
i=1 fi (θ), with gradient f (θ), let θ ∈ Rp be
its minimizer, and denote its Hessian at θ by H := 2f (θ) . Assume that ∀θ ∈ Rp, f satisfies:
(F1) Weak strong convexity: (θ − θ) f (θ) ≥ α θ − θ 2
2, for constant α > 0,
(F2) Lipschitz gradient continuity: f (θ) 2 ≤ L θ − θ 2, for constant L > 0,
(F3) Bounded Taylor remainder: f (θ) − H(θ − θ) 2 ≤ E θ − θ 2
2, for constant E > 0,
(F4) Bounded Hessian spectrum at θ: 0 < λL ≤ λi (H) ≤ λU < ∞, ∀i.
Furthermore, let gs(θ) be a stochastic gradient of f , satisfying:
(G1) E [gs(θ) | θ] = f (θ),
(G2) E gs(θ) 2
2 | θ ≤ A θ − θ 2
2 + B,
(G3) E gs(θ) 4
2 | θ ≤ C θ − θ 4
2 + D,
(G4) E gs(θ)gs(θ) | θ − G 2
≤ A1 θ − θ 2 + A2 θ − θ 2
2 + A3 θ − θ 3
2 + A4 θ − θ 4
2,
for positive, data dependent constants A, B, C, D, Ai , for i = 1, . . . , 4. Assume that
θ1 − θ 2
2 = O(η); then for sufficiently small step size η > 0, the average SGD sequence
θt = 1
t
n
i=1 θi satisfies:
tE[(¯θt − θ)(¯θt − θ) ] − H−1
GH−1
2
√
η + 1
tη + tη2,
where G = E[gs(θ)gs(θ) | θ].

Theoretical guarantee
Theorem
For a differentiable convex function f (θ) = 1
n
n
i=1 fi (θ), with gradient f (θ), let θ ∈ Rp be
its minimizer, and denote its Hessian at θ by H := 2f (θ) . Assume that ∀θ ∈ Rp, f satisfies:
(F1) Weak strong convexity: (θ − θ) f (θ) ≥ α θ − θ 2
2, for constant α > 0,
(F2) Lipschitz gradient continuity: f (θ) 2 ≤ L θ − θ 2, for constant L > 0,
(F3) Bounded Taylor remainder: f (θ) − H(θ − θ) 2 ≤ E θ − θ 2
2, for constant E > 0,
(F4) Bounded Hessian spectrum at θ: 0 < λL ≤ λi (H) ≤ λU < ∞, ∀i.
Furthermore, let gs(θ) be a stochastic gradient of f , satisfying:
(G1) E [gs(θ) | θ] = f (θ),
(G2) E gs(θ) 2
2 | θ ≤ A θ − θ 2
2 + B,
(G3) E gs(θ) 4
2 | θ ≤ C θ − θ 4
2 + D,
(G4) E gs(θ)gs(θ) | θ − G 2
≤ A1 θ − θ 2 + A2 θ − θ 2
2 + A3 θ − θ 3
2 + A4 θ − θ 4
2,
for positive, data dependent constants A, B, C, D, Ai , for i = 1, . . . , 4. Assume that
θ1 − θ 2
2 = O(η); then for sufficiently small step size η > 0, the average SGD sequence
θt = 1
t
n
i=1 θi satisfies:
tE[(¯θt − θ)(¯θt − θ) ] − H−1
GH−1
2
√
η + 1
tη + tη2,
where G = E[gs(θ)gs(θ) | θ].
Proof idea: H−1 = η i≥0(I − ηH)i

Comparison with bootstrap
Univariate model estimation
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
0.0
0.5
1.0
1.5
2.0
N(0, 1/n)
θSGD − ¯θSGD
θbootstrap − ¯θbootstrap
(a) Normal
1√
2π
exp(−(x−µ)2
2
)
µ = 0
0.8 1.0 1.2 1.4 1.6
0
1
2
3
4
SGD
bootstrap
(b) Exponential
µe−µx
µ = 1
0.8 1.0 1.2 1.4
0
1
2
3
4
5
SGD
bootstrap
(c) Poisson
µx
e−µx
x!
µ = 1

95% conﬁdence interval coverage simulation
η t = 100 t = 500 t = 2500
0.1 (0.957, 4.41) (0.955, 4.51) (0.960, 4.53)
0.02 (0.869, 3.30) (0.923, 3.77) (0.918, 3.87)
0.004 (0.634, 2.01) (0.862, 3.20) (0.916, 3.70)
(a) Bootstrap (0.941, 4.14), normal approximation (0.928, 3.87)
η t = 100 t = 500 t = 2500
0.1 (0.949, 4.74) (0.962, 4.91) (0.963, 4.94)
0.02 (0.845, 3.37) (0.916, 4.01) (0.927, 4.17)
0.004 (0.616, 2.00) (0.832, 3.30) (0.897, 3.93)
(b) Bootstrap (0.938, 4.47), normal approximation (0.925, 4.18)
Table 1: Linear regression: dimension = 10, 100 samples. (a) diagonal
covariance (b) non-diagonal covariance
η t = 100 t = 500 t = 2500
0.1 (0.872, 0.204) (0.937, 0.249) (0.939, 0.258)
0.02 (0.610, 0.112) (0.871, 0.196) (0.926, 0.237)
0.004 (0.312, 0.051) (0.596, 0.111) (0.86, 0.194)
(a) Bootstrap (0.932, 0.253), normal approximation (0.957, 0.264)
η t = 100 t = 500 t = 2500
0.1 (0.859, 0.206) (0.931, 0.255) (0.947, 0.266)
0.02 (0.600, 0.112) (0.847, 0.197) (0.931, 0.244)
0.004 (0.302, 0.051) (0.583, 0.111) (0.851, 0.195)
(b) Bootstrap (0.932, 0.245), normal approximation (0.954, 0.256)
Table 2: Logistic regression: dimension = 10, 1000 samples. (a) diagonal
covariance (b) non-diagonal covariance
Better when
each replicate’s average uses a longer consecutive sequence
larger step size
(coverage probability, conﬁdence interval width)

Adversarial Attacks
Neural network classiﬁers with very high accuracy on test sets are
extremely susceptible to nearly imperceptible adversarial attacks.

Conﬁdence intervals for mitigating adversarial examples
MNIST – logistic regression
0 5 10 15 20 25
0
5
10
15
20
25
(b) Original “0”:
P{0 | image} ≈ 1 − e−46
CI ≈ (1 − e−28
, 1 − e−64
)
0 5 10 15 20 25
0
5
10
15
20
25
(c) Adversarial “0”:
P{0 | image} ≈ e−17
CI ≈ (e−31
, 1 − e−11
)
0 5 10 15 20 25
0
5
10
15
20
25
Figure 1: MNIST adversarial perturbation
(scaled for display)
Adversarial examples produced by gradient attack have
large conﬁdence intervals!

Statistical Inference Using Stochastic Gradient Descent

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Statistical Inference Using Stochastic Gradient Descent

Similar to Statistical Inference Using Stochastic Gradient Descent (20)

More from Center for Transportation Research - UT Austin

More from Center for Transportation Research - UT Austin (20)

Recently uploaded

Recently uploaded (20)

Statistical Inference Using Stochastic Gradient Descent