IAC 2024 - IA Fast Track to Search Focused AI Solutions
Deep Learning Theory Seminar (Chap 1-2, part 1)
1. Deep Learning Theory Lecture Note
Chapter 1-2 (part 1)
2022.03.02.
KAIST ALIN-LAB
Sangwoo Mo
1
2. • Setup
• Consider feedforward networks
• Often only consider shallow networks
• Omit advanced architectures (e.g., ResNet, RNN, Transformer) for simplicity
• Consider supervised learning
• Given training samples { 𝑥!, 𝑦! }, minimize empirical risk
• We aim to minimize the (population) risk
• Consider (1-dim) binary classification (𝑦 ∈ {+1, −1}) with squared loss
• Omit advanced learning schemes (e.g., self-supervised) for simplicity
Overview of the lecture
2
3. • Topics of deep learning theory
• Recall that we aim to minimize the risk ℛ( (
𝑓)
• It can be decomposed to 3 terms ( ̅
𝑓 ∈ ℱ is some reference solution):
1. Approximation: The hypothesis space ℱ is expressive enough
• Risk of global optima ̅
𝑓 is small
2. Optimization: Can find the (near-) global optima with SGD
• Empirical risk of learned model (
𝑓 ≈ global optima ̅
𝑓
3. Generalization: Learned model can also predict unseen samples
• Empirical risk /
ℛ ≈ population risk ℛ
Overview of the lecture
3
Treat them
together
ℱ: hypothesis space
𝑓 ∈ ℱ: hypothesis (NN in our case)
%
𝑓 ∈ ℱ: hypothesis that minimizes empirical risk &
ℛ(𝑓)
̅
𝑓 ∈ ℱ: hypothesis that minimizes population risk ℛ(𝑓)
4. • Approximation → Bound function norm
• NN can approximate an arbitrary smooth (Lipschitz) function in a compact domain 𝑆
• ∀𝑔 ∈ 𝒞(𝑆) (space of smooth func.), ∃𝑓 ∈ ℱ such that ℛ 𝑓 − ℛ 𝑔 < 𝜖
• We bound the gap of risks by a function norm
• Specifically, consider two function norms:
• Uniform norm (worst-case)
• 𝐿+ norm (avg. case)
Chap 1. Approximation
4
(closed and bounded, e.g., [0,1])
Loss ℓ is 𝜌-Lipschitz
5. • Overview of the chapter
• In this chapter, we prove the approximation of finite-width NN
• (2.1) Constructive proof for specific activations
• (2.2) Universal approximation for general activations
• Here, carefully check the assumption of activation function 𝜎 (e.g., sigmoid, ReLU)
• Spoiler
• Chap 3. Define NN as an infinite-width NN 𝑓 = ∫ 𝜎(⋯ ) – a.k.a. Barron’s construction
• Sample finite nodes to approx. integral ⇒ Error goes to 0
• Chap 4. An infinite-width NN near initialization is analytically represented – a.k.a. NTK
• Corresponding hypothesis space (RKHS) is a universal approximator
Chap 2. Approximation of finite-width NN
5
6. • Univariate case
• Smooth function can be approximated by a piece-wise constant function
• can approximate arbitrary function
• It is a 2-layer MLP with an indicator activation 𝟏[𝑥 ≥ 0]
Chap 2.1 Constructive proof
6
7. • Univariate case
• Smooth function can be approximated by a piece-wise constant function
• can approximate arbitrary function
• It is a 2-layer MLP with an indicator activation 𝟏[𝑥 ≥ 0]
Chap 2.1 Constructive proof
7
key logic
# of nodes 𝑚 ∝ 1/error
8. • Multivariate case
• This logic can be extended to
• A compact set 𝑈 ⊂ ℝ5 can be approx. by a partition of rectangles
Chap 2.1 Constructive proof
8
9. • Multivariate case
• This logic can be extended to
• A compact set 𝑈 ⊂ ℝ5 can be approx. by a partition of rectangles
Chap 2.1 Constructive proof
9
# of nodes ∝ (1/𝛿)^𝑑
(curse of dimension)
10. • Multivariate case
• This logic can be extended to
• A compact set 𝑈 ⊂ ℝ5 can be approx. by a partition of rectangles
• Similar to before, 2-layer MLP can approximate arbitrary 𝑔
• However, the indicator activation 𝟏:!
is an uncommon choice
• Instead, we approximate 𝟏:!
with 2-layer ReLU composition
⇒ A 3-layer MLP of ReLU activation can approx. arbitrary multivariate 𝑔
Chap 2.1 Constructive proof
10
Only guarantee 𝐿! norm (not uniform norm)
11. • Multivariate case
• A 3-layer MLP of ReLU activation can approx. arbitrary multivariate 𝑔
• Proof. The only remaining step is approximating 𝟏:!
with ReLU
Chap 2.1 Constructive proof
11
Indicator for 1-dim interval
Indicator for 𝑑-dim rectangle
12. • Multivariate case
• A 3-layer MLP of ReLU activation can approx. arbitrary multivariate 𝑔
• Proof. The only remaining step is approximating 𝟏:!
with ReLU
Chap 2.1 Constructive proof
12
key logic
13. • (2.1) Constructive proof
• Approximate an univariate function 𝑔 ∈ ℝ → ℝ
with a 2-layer MLP with an indicator activation (by uniform norm)
• Approximate a multivariate function 𝑔 ∈ ℝ5 → ℝ
with a 3-layer MLP with a ReLU activation (by 𝐿+ norm)
• Also note that this construction requires exponential (over dim) nodes
Summary
13