SlideShare a Scribd company logo
1 of 32
Download to read offline
Deep Learning Basics
Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization
• Specifically: additional terms in the training optimization objective to
prevent overfitting or help the optimization
Review: overfitting
𝑡 = sin 2𝜋𝑥 + 𝜖
Figure from Machine Learning
and Pattern Recognition, Bishop
Overfitting example: regression using polynomials
Overfitting example: regression using polynomials
Figure from Machine Learning
and Pattern Recognition, Bishop
Overfitting
• Empirical loss and expected loss are different
• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps
• Classical regularization: some principal ways to constrain hypotheses
• Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint
• Training objective
min
𝑓
෠
𝐿 𝑓 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝑓, 𝑥𝑖, 𝑦𝑖)
subject to: 𝑓 ∈ 𝓗
• When parametrized
min
𝜃
෠
𝐿 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖)
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
min
𝜃
෠
𝐿 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖)
subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization
min
𝜃
෠
𝐿 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖)
subject to: | 𝜃| 2
2
≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
min
𝜃
෠
𝐿𝑅 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗𝑅(𝜃)
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
min
𝜃
෠
𝐿𝑅 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗| 𝜃| 2
2
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ ෠
𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃∗ is the optimal for hard-constraint optimization
𝜃∗ = argmin
𝜃
max
𝜆≥0
ℒ 𝜃, 𝜆 ≔ ෠
𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃∗ = argmin
𝜃
ℒ 𝜃, 𝜆∗ ≔ ෠
𝐿 𝜃 + 𝜆∗[𝑅 𝜃 − 𝑟]
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖}
• Likelihood: 𝑝 𝑥𝑖, 𝑦𝑖 𝜃)
• Bayesian rule:
𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} =
𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃)
𝑝({𝑥𝑖, 𝑦𝑖})
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} =
𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃)
𝑝({𝑥𝑖, 𝑦𝑖})
• Maximum A Posteriori (MAP):
max
𝜃
log 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = max
𝜃
log 𝑝 𝜃 + log 𝑝 𝑥𝑖, 𝑦𝑖 | 𝜃
Regularization MLE loss
Regularization as Bayesian prior
• Example: 𝑙2 loss with 𝑙2 regularization
min
𝜃
෠
𝐿𝑅 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑓𝜃 𝑥𝑖 − 𝑦𝑖
2 + 𝜆∗| 𝜃| 2
2
• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)
Three views
• Typical choice for optimization: soft-constraint
min
𝜃
෠
𝐿𝑅 𝜃 = ෠
𝐿 𝜃 + 𝜆𝑅(𝜃)
• Hard constraint and Bayesian view: conceptual; or used for derivation
Three views
• Hard-constraint preferred if
• Know the explicit bound 𝑅 𝜃 ≤ 𝑟
• Soft-constraint causes trapped in a local minima with small 𝜃
• Projection back to feasible set leads to stability
• Bayesian view preferred if
• Know the prior distribution
Some examples
Classical regularization
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization
• Robustness to noise
𝑙2 regularization
min
𝜃
෠
𝐿𝑅 𝜃 = ෠
𝐿(𝜃) +
𝛼
2
| 𝜃| 2
2
• Effect on (stochastic) gradient descent
• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻෠
𝐿𝑅 𝜃 = 𝛻෠
𝐿(𝜃) + 𝛼𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻෠
𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠
𝐿 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻෠
𝐿 𝜃
• Terminology: weight decay
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃∗
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
• Since 𝜃∗
is optimal, 𝛻෠
𝐿 𝜃∗
= 0
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
𝛻෠
𝐿 𝜃 ≈ 𝐻 𝜃 − 𝜃∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻෠
𝐿𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃∗ + 𝛼𝜃
• On the optimal 𝜃𝑅
∗
0 = 𝛻෠
𝐿𝑅 𝜃𝑅
∗
≈ 𝐻 𝜃𝑅
∗
− 𝜃∗ + 𝛼𝜃𝑅
∗
𝜃𝑅
∗
≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗
Effect on the optimal solution
• The optimal
𝜃𝑅
∗
≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗
• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄𝑇
𝜃𝑅
∗
≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃∗
= 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄𝑇
𝜃∗
• Effect: rescale along eigenvectors of 𝐻
Effect on the optimal solution
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Notations:
𝜃∗ = 𝑤∗, 𝜃𝑅
∗
= ෥
𝑤
𝑙1 regularization
min
𝜃
෠
𝐿𝑅 𝜃 = ෠
𝐿(𝜃) + 𝛼| 𝜃 |1
• Effect on (stochastic) gradient descent
• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻෠
𝐿𝑅 𝜃 = 𝛻෠
𝐿 𝜃 + 𝛼 sign(𝜃)
where sign applies to each element in 𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻෠
𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠
𝐿 𝜃 − 𝜂𝛼 sign(𝜃)
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃∗
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
• Since 𝜃∗
is optimal, 𝛻෠
𝐿 𝜃∗
= 0
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖> 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
෠
𝐿𝑅 𝜃 ≈ ෍
𝑖
1
2
𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖
∗ 2
+ 𝛼 |𝜃𝑖|
• The optimal 𝜃𝑅
∗
(𝜃𝑅
∗
)𝑖 ≈
max 𝜃𝑖
∗
−
𝛼
𝐻𝑖𝑖
, 0 if 𝜃𝑖
∗
≥ 0
min 𝜃𝑖
∗
+
𝛼
𝐻𝑖𝑖
, 0 if 𝜃𝑖
∗
< 0
Effect on the optimal solution
• Effect: induce sparsity
−
𝛼
𝐻𝑖𝑖
𝛼
𝐻𝑖𝑖
(𝜃𝑅
∗
)𝑖
(𝜃∗)𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅
∗
(𝜃𝑅
∗
)𝑖 ≈ sign 𝜃𝑖
∗
max{ 𝜃𝑖
∗
−
𝛼
𝐻𝑖𝑖
, 0}
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior
𝑝 𝜃 ∝ exp(𝛼 ෍
𝑖
|𝜃𝑖|)
log 𝑝 𝜃 = 𝛼 ෍
𝑖
|𝜃𝑖| + constant = 𝛼| 𝜃 |1 + constant

More Related Content

What's hot

Coordinate Descent method
Coordinate Descent methodCoordinate Descent method
Coordinate Descent methodSanghyuk Chun
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descentankit_ppt
 
Neural Networks. Overview
Neural Networks. OverviewNeural Networks. Overview
Neural Networks. OverviewOleksandr Baiev
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsankit_ppt
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
Image classification with neural networks
Image classification with neural networksImage classification with neural networks
Image classification with neural networksSepehr Rasouli
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earningAnirudh Ganguly
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingChenYiHuang5
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updatedVajira Thambawita
 
Introduction To Neural Network
Introduction To Neural NetworkIntroduction To Neural Network
Introduction To Neural NetworkBangalore
 
Back propagation
Back propagationBack propagation
Back propagationBangalore
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
A temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksA temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksDaniele Loiacono
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compressionDavid Tung
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionLEE HOSEONG
 

What's hot (20)

Coordinate Descent method
Coordinate Descent methodCoordinate Descent method
Coordinate Descent method
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Neurally Controlled Robot That Learns
Neurally Controlled Robot That LearnsNeurally Controlled Robot That Learns
Neurally Controlled Robot That Learns
 
Neural Networks. Overview
Neural Networks. OverviewNeural Networks. Overview
Neural Networks. Overview
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Image classification with neural networks
Image classification with neural networksImage classification with neural networks
Image classification with neural networks
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routing
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updated
 
Introduction To Neural Network
Introduction To Neural NetworkIntroduction To Neural Network
Introduction To Neural Network
 
Back propagation
Back propagationBack propagation
Back propagation
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
A temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksA temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networks
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
 

Similar to DL_lecture3_regularization_I.pdf

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptxZhiwuGuo1
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Intro to statistical signal processing
Intro to statistical signal processingIntro to statistical signal processing
Intro to statistical signal processingNadav Carmel
 
Optimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methodsOptimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methodsSantiagoGarridoBulln
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesJinho Lee
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementationJongsu "Liam" Kim
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to applicationGAYO3
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptxKarasuLee
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffTaeoh Kim
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics JCMwave
 

Similar to DL_lecture3_regularization_I.pdf (20)

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Intro to statistical signal processing
Intro to statistical signal processingIntro to statistical signal processing
Intro to statistical signal processing
 
Optimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methodsOptimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methods
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to application
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

DL_lecture3_regularization_I.pdf

  • 1. Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang
  • 2. What is regularization? • In general: any method to prevent overfitting or help the optimization • Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization
  • 4. 𝑡 = sin 2𝜋𝑥 + 𝜖 Figure from Machine Learning and Pattern Recognition, Bishop Overfitting example: regression using polynomials
  • 5. Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition, Bishop
  • 6. Overfitting • Empirical loss and expected loss are different • Smaller the data set, larger the difference between the two • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Thus has small training error but large test error (overfitting)
  • 7. Prevent overfitting • Larger data set helps • Throwing away useless hypotheses also helps • Classical regularization: some principal ways to constrain hypotheses • Other types of regularization: data augmentation, early stopping, etc.
  • 8. Different views of regularization
  • 9. Regularization as hard constraint • Training objective min 𝑓 ෠ 𝐿 𝑓 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝑓, 𝑥𝑖, 𝑦𝑖) subject to: 𝑓 ∈ 𝓗 • When parametrized min 𝜃 ෠ 𝐿 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) subject to: 𝜃 ∈ 𝛺
  • 10. Regularization as hard constraint • When 𝛺 measured by some quantity 𝑅 min 𝜃 ෠ 𝐿 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) subject to: 𝑅 𝜃 ≤ 𝑟 • Example: 𝑙2 regularization min 𝜃 ෠ 𝐿 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) subject to: | 𝜃| 2 2 ≤ 𝑟2
  • 11. Regularization as soft constraint • The hard-constraint optimization is equivalent to soft-constraint min 𝜃 ෠ 𝐿𝑅 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗𝑅(𝜃) for some regularization parameter 𝜆∗ > 0 • Example: 𝑙2 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗| 𝜃| 2 2
  • 12. Regularization as soft constraint • Showed by Lagrangian multiplier method ℒ 𝜃, 𝜆 ≔ ෠ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟] • Suppose 𝜃∗ is the optimal for hard-constraint optimization 𝜃∗ = argmin 𝜃 max 𝜆≥0 ℒ 𝜃, 𝜆 ≔ ෠ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟] • Suppose 𝜆∗ is the corresponding optimal for max 𝜃∗ = argmin 𝜃 ℒ 𝜃, 𝜆∗ ≔ ෠ 𝐿 𝜃 + 𝜆∗[𝑅 𝜃 − 𝑟]
  • 13. Regularization as Bayesian prior • Bayesian view: everything is a distribution • Prior over the hypotheses: 𝑝 𝜃 • Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} • Likelihood: 𝑝 𝑥𝑖, 𝑦𝑖 𝜃) • Bayesian rule: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = 𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃) 𝑝({𝑥𝑖, 𝑦𝑖})
  • 14. Regularization as Bayesian prior • Bayesian rule: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = 𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃) 𝑝({𝑥𝑖, 𝑦𝑖}) • Maximum A Posteriori (MAP): max 𝜃 log 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = max 𝜃 log 𝑝 𝜃 + log 𝑝 𝑥𝑖, 𝑦𝑖 | 𝜃 Regularization MLE loss
  • 15. Regularization as Bayesian prior • Example: 𝑙2 loss with 𝑙2 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑓𝜃 𝑥𝑖 − 𝑦𝑖 2 + 𝜆∗| 𝜃| 2 2 • Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)
  • 16. Three views • Typical choice for optimization: soft-constraint min 𝜃 ෠ 𝐿𝑅 𝜃 = ෠ 𝐿 𝜃 + 𝜆𝑅(𝜃) • Hard constraint and Bayesian view: conceptual; or used for derivation
  • 17. Three views • Hard-constraint preferred if • Know the explicit bound 𝑅 𝜃 ≤ 𝑟 • Soft-constraint causes trapped in a local minima with small 𝜃 • Projection back to feasible set leads to stability • Bayesian view preferred if • Know the prior distribution
  • 19. Classical regularization • Norm penalty • 𝑙2 regularization • 𝑙1 regularization • Robustness to noise
  • 20. 𝑙2 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = ෠ 𝐿(𝜃) + 𝛼 2 | 𝜃| 2 2 • Effect on (stochastic) gradient descent • Effect on the optimal solution
  • 21. Effect on gradient descent • Gradient of regularized objective 𝛻෠ 𝐿𝑅 𝜃 = 𝛻෠ 𝐿(𝜃) + 𝛼𝜃 • Gradient descent update 𝜃 ← 𝜃 − 𝜂𝛻෠ 𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠ 𝐿 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻෠ 𝐿 𝜃 • Terminology: weight decay
  • 22. Effect on the optimal solution • Consider a quadratic approximation around 𝜃∗ ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗ • Since 𝜃∗ is optimal, 𝛻෠ 𝐿 𝜃∗ = 0 ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗ 𝛻෠ 𝐿 𝜃 ≈ 𝐻 𝜃 − 𝜃∗
  • 23. Effect on the optimal solution • Gradient of regularized objective 𝛻෠ 𝐿𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃∗ + 𝛼𝜃 • On the optimal 𝜃𝑅 ∗ 0 = 𝛻෠ 𝐿𝑅 𝜃𝑅 ∗ ≈ 𝐻 𝜃𝑅 ∗ − 𝜃∗ + 𝛼𝜃𝑅 ∗ 𝜃𝑅 ∗ ≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗
  • 24. Effect on the optimal solution • The optimal 𝜃𝑅 ∗ ≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗ • Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄𝑇 𝜃𝑅 ∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃∗ = 𝑄 Λ + 𝛼𝐼 −1 Λ𝑄𝑇 𝜃∗ • Effect: rescale along eigenvectors of 𝐻
  • 25. Effect on the optimal solution Figure from Deep Learning, Goodfellow, Bengio and Courville Notations: 𝜃∗ = 𝑤∗, 𝜃𝑅 ∗ = ෥ 𝑤
  • 26. 𝑙1 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = ෠ 𝐿(𝜃) + 𝛼| 𝜃 |1 • Effect on (stochastic) gradient descent • Effect on the optimal solution
  • 27. Effect on gradient descent • Gradient of regularized objective 𝛻෠ 𝐿𝑅 𝜃 = 𝛻෠ 𝐿 𝜃 + 𝛼 sign(𝜃) where sign applies to each element in 𝜃 • Gradient descent update 𝜃 ← 𝜃 − 𝜂𝛻෠ 𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠ 𝐿 𝜃 − 𝜂𝛼 sign(𝜃)
  • 28. Effect on the optimal solution • Consider a quadratic approximation around 𝜃∗ ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗ • Since 𝜃∗ is optimal, 𝛻෠ 𝐿 𝜃∗ = 0 ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
  • 29. Effect on the optimal solution • Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖> 0, ∀𝑖) • not true in general but assume for getting some intuition • The regularized objective is (ignoring constants) ෠ 𝐿𝑅 𝜃 ≈ ෍ 𝑖 1 2 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖 ∗ 2 + 𝛼 |𝜃𝑖| • The optimal 𝜃𝑅 ∗ (𝜃𝑅 ∗ )𝑖 ≈ max 𝜃𝑖 ∗ − 𝛼 𝐻𝑖𝑖 , 0 if 𝜃𝑖 ∗ ≥ 0 min 𝜃𝑖 ∗ + 𝛼 𝐻𝑖𝑖 , 0 if 𝜃𝑖 ∗ < 0
  • 30. Effect on the optimal solution • Effect: induce sparsity − 𝛼 𝐻𝑖𝑖 𝛼 𝐻𝑖𝑖 (𝜃𝑅 ∗ )𝑖 (𝜃∗)𝑖
  • 31. Effect on the optimal solution • Further assume that 𝐻 is diagonal • Compact expression for the optimal 𝜃𝑅 ∗ (𝜃𝑅 ∗ )𝑖 ≈ sign 𝜃𝑖 ∗ max{ 𝜃𝑖 ∗ − 𝛼 𝐻𝑖𝑖 , 0}
  • 32. Bayesian view • 𝑙1 regularization corresponds to Laplacian prior 𝑝 𝜃 ∝ exp(𝛼 ෍ 𝑖 |𝜃𝑖|) log 𝑝 𝜃 = 𝛼 ෍ 𝑖 |𝜃𝑖| + constant = 𝛼| 𝜃 |1 + constant