SlideShare a Scribd company logo
Deep Learning Basics
Lecture 3: Regularization I
Princeton University COS 495
Instructor: Yingyu Liang
What is regularization?
• In general: any method to prevent overfitting or help the optimization
• Specifically: additional terms in the training optimization objective to
prevent overfitting or help the optimization
Review: overfitting
𝑡 = sin 2𝜋𝑥 + 𝜖
Figure from Machine Learning
and Pattern Recognition, Bishop
Overfitting example: regression using polynomials
Overfitting example: regression using polynomials
Figure from Machine Learning
and Pattern Recognition, Bishop
Overfitting
• Empirical loss and expected loss are different
• Smaller the data set, larger the difference between the two
• Larger the hypothesis class, easier to find a hypothesis that fits the
difference between the two
• Thus has small training error but large test error (overfitting)
Prevent overfitting
• Larger data set helps
• Throwing away useless hypotheses also helps
• Classical regularization: some principal ways to constrain hypotheses
• Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint
• Training objective
min
𝑓
෠
𝐿 𝑓 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝑓, 𝑥𝑖, 𝑦𝑖)
subject to: 𝑓 ∈ 𝓗
• When parametrized
min
𝜃
෠
𝐿 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖)
subject to: 𝜃 ∈ 𝛺
Regularization as hard constraint
• When 𝛺 measured by some quantity 𝑅
min
𝜃
෠
𝐿 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖)
subject to: 𝑅 𝜃 ≤ 𝑟
• Example: 𝑙2 regularization
min
𝜃
෠
𝐿 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖)
subject to: | 𝜃| 2
2
≤ 𝑟2
Regularization as soft constraint
• The hard-constraint optimization is equivalent to soft-constraint
min
𝜃
෠
𝐿𝑅 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗𝑅(𝜃)
for some regularization parameter 𝜆∗ > 0
• Example: 𝑙2 regularization
min
𝜃
෠
𝐿𝑅 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗| 𝜃| 2
2
Regularization as soft constraint
• Showed by Lagrangian multiplier method
ℒ 𝜃, 𝜆 ≔ ෠
𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜃∗ is the optimal for hard-constraint optimization
𝜃∗ = argmin
𝜃
max
𝜆≥0
ℒ 𝜃, 𝜆 ≔ ෠
𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟]
• Suppose 𝜆∗ is the corresponding optimal for max
𝜃∗ = argmin
𝜃
ℒ 𝜃, 𝜆∗ ≔ ෠
𝐿 𝜃 + 𝜆∗[𝑅 𝜃 − 𝑟]
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: 𝑝 𝜃
• Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖}
• Likelihood: 𝑝 𝑥𝑖, 𝑦𝑖 𝜃)
• Bayesian rule:
𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} =
𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃)
𝑝({𝑥𝑖, 𝑦𝑖})
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} =
𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃)
𝑝({𝑥𝑖, 𝑦𝑖})
• Maximum A Posteriori (MAP):
max
𝜃
log 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = max
𝜃
log 𝑝 𝜃 + log 𝑝 𝑥𝑖, 𝑦𝑖 | 𝜃
Regularization MLE loss
Regularization as Bayesian prior
• Example: 𝑙2 loss with 𝑙2 regularization
min
𝜃
෠
𝐿𝑅 𝜃 =
1
𝑛
෍
𝑖=1
𝑛
𝑓𝜃 𝑥𝑖 − 𝑦𝑖
2 + 𝜆∗| 𝜃| 2
2
• Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)
Three views
• Typical choice for optimization: soft-constraint
min
𝜃
෠
𝐿𝑅 𝜃 = ෠
𝐿 𝜃 + 𝜆𝑅(𝜃)
• Hard constraint and Bayesian view: conceptual; or used for derivation
Three views
• Hard-constraint preferred if
• Know the explicit bound 𝑅 𝜃 ≤ 𝑟
• Soft-constraint causes trapped in a local minima with small 𝜃
• Projection back to feasible set leads to stability
• Bayesian view preferred if
• Know the prior distribution
Some examples
Classical regularization
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization
• Robustness to noise
𝑙2 regularization
min
𝜃
෠
𝐿𝑅 𝜃 = ෠
𝐿(𝜃) +
𝛼
2
| 𝜃| 2
2
• Effect on (stochastic) gradient descent
• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻෠
𝐿𝑅 𝜃 = 𝛻෠
𝐿(𝜃) + 𝛼𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻෠
𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠
𝐿 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻෠
𝐿 𝜃
• Terminology: weight decay
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃∗
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
• Since 𝜃∗
is optimal, 𝛻෠
𝐿 𝜃∗
= 0
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
𝛻෠
𝐿 𝜃 ≈ 𝐻 𝜃 − 𝜃∗
Effect on the optimal solution
• Gradient of regularized objective
𝛻෠
𝐿𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃∗ + 𝛼𝜃
• On the optimal 𝜃𝑅
∗
0 = 𝛻෠
𝐿𝑅 𝜃𝑅
∗
≈ 𝐻 𝜃𝑅
∗
− 𝜃∗ + 𝛼𝜃𝑅
∗
𝜃𝑅
∗
≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗
Effect on the optimal solution
• The optimal
𝜃𝑅
∗
≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗
• Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄𝑇
𝜃𝑅
∗
≈ 𝐻 + 𝛼𝐼 −1
𝐻𝜃∗
= 𝑄 Λ + 𝛼𝐼 −1
Λ𝑄𝑇
𝜃∗
• Effect: rescale along eigenvectors of 𝐻
Effect on the optimal solution
Figure from Deep Learning,
Goodfellow, Bengio and Courville
Notations:
𝜃∗ = 𝑤∗, 𝜃𝑅
∗
= ෥
𝑤
𝑙1 regularization
min
𝜃
෠
𝐿𝑅 𝜃 = ෠
𝐿(𝜃) + 𝛼| 𝜃 |1
• Effect on (stochastic) gradient descent
• Effect on the optimal solution
Effect on gradient descent
• Gradient of regularized objective
𝛻෠
𝐿𝑅 𝜃 = 𝛻෠
𝐿 𝜃 + 𝛼 sign(𝜃)
where sign applies to each element in 𝜃
• Gradient descent update
𝜃 ← 𝜃 − 𝜂𝛻෠
𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠
𝐿 𝜃 − 𝜂𝛼 sign(𝜃)
Effect on the optimal solution
• Consider a quadratic approximation around 𝜃∗
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
• Since 𝜃∗
is optimal, 𝛻෠
𝐿 𝜃∗
= 0
෠
𝐿 𝜃 ≈ ෠
𝐿 𝜃∗ +
1
2
𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
Effect on the optimal solution
• Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖> 0, ∀𝑖)
• not true in general but assume for getting some intuition
• The regularized objective is (ignoring constants)
෠
𝐿𝑅 𝜃 ≈ ෍
𝑖
1
2
𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖
∗ 2
+ 𝛼 |𝜃𝑖|
• The optimal 𝜃𝑅
∗
(𝜃𝑅
∗
)𝑖 ≈
max 𝜃𝑖
∗
−
𝛼
𝐻𝑖𝑖
, 0 if 𝜃𝑖
∗
≥ 0
min 𝜃𝑖
∗
+
𝛼
𝐻𝑖𝑖
, 0 if 𝜃𝑖
∗
< 0
Effect on the optimal solution
• Effect: induce sparsity
−
𝛼
𝐻𝑖𝑖
𝛼
𝐻𝑖𝑖
(𝜃𝑅
∗
)𝑖
(𝜃∗)𝑖
Effect on the optimal solution
• Further assume that 𝐻 is diagonal
• Compact expression for the optimal 𝜃𝑅
∗
(𝜃𝑅
∗
)𝑖 ≈ sign 𝜃𝑖
∗
max{ 𝜃𝑖
∗
−
𝛼
𝐻𝑖𝑖
, 0}
Bayesian view
• 𝑙1 regularization corresponds to Laplacian prior
𝑝 𝜃 ∝ exp(𝛼 ෍
𝑖
|𝜃𝑖|)
log 𝑝 𝜃 = 𝛼 ෍
𝑖
|𝜃𝑖| + constant = 𝛼| 𝜃 |1 + constant

More Related Content

What's hot

Coordinate Descent method
Coordinate Descent methodCoordinate Descent method
Coordinate Descent method
Sanghyuk Chun
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 
Neurally Controlled Robot That Learns
Neurally Controlled Robot That LearnsNeurally Controlled Robot That Learns
Neurally Controlled Robot That Learns
Benjamin Walther Büel
 
Neural Networks. Overview
Neural Networks. OverviewNeural Networks. Overview
Neural Networks. Overview
Oleksandr Baiev
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
ankit_ppt
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hakky St
 
Image classification with neural networks
Image classification with neural networksImage classification with neural networks
Image classification with neural networks
Sepehr Rasouli
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
Anirudh Ganguly
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routing
ChenYiHuang5
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updated
Vajira Thambawita
 
Introduction To Neural Network
Introduction To Neural NetworkIntroduction To Neural Network
Introduction To Neural Network
Bangalore
 
Back propagation
Back propagationBack propagation
Back propagation
Bangalore
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
Lyft
 
A temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksA temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networks
Daniele Loiacono
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Ahmed Yousry
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
David Tung
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
Venkata Reddy Konasani
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
LEE HOSEONG
 

What's hot (20)

Coordinate Descent method
Coordinate Descent methodCoordinate Descent method
Coordinate Descent method
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Neurally Controlled Robot That Learns
Neurally Controlled Robot That LearnsNeurally Controlled Robot That Learns
Neurally Controlled Robot That Learns
 
Neural Networks. Overview
Neural Networks. OverviewNeural Networks. Overview
Neural Networks. Overview
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Image classification with neural networks
Image classification with neural networksImage classification with neural networks
Image classification with neural networks
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
 
Paper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routingPaper Study: A learning based iterative method for solving vehicle routing
Paper Study: A learning based iterative method for solving vehicle routing
 
Lecture 5 machine learning updated
Lecture 5   machine learning updatedLecture 5   machine learning updated
Lecture 5 machine learning updated
 
Introduction To Neural Network
Introduction To Neural NetworkIntroduction To Neural Network
Introduction To Neural Network
 
Back propagation
Back propagationBack propagation
Back propagation
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
A temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networksA temporal classifier system using spiking neural networks
A temporal classifier system using spiking neural networks
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
 

Similar to DL_lecture3_regularization_I.pdf

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
ZhiwuGuo1
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
ChenYiHuang5
 
Intro to statistical signal processing
Intro to statistical signal processingIntro to statistical signal processing
Intro to statistical signal processing
Nadav Carmel
 
Optimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methodsOptimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methods
SantiagoGarridoBulln
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Jinho Lee
 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
HassanAhmad442087
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
SantiagoGarridoBulln
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
Tsuyoshi Sakama
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Jongsu "Liam" Kim
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
taeseon ryu
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to application
GAYO3
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Sanghyuk Chun
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
taeseon ryu
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
ChenYiHuang5
 
Continuous control
Continuous controlContinuous control
Continuous control
Reiji Hatsugai
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
Mohammad Reza Jabbari
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
Seungeon Baek
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
KarasuLee
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
Taeoh Kim
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
JCMwave
 

Similar to DL_lecture3_regularization_I.pdf (20)

2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Intro to statistical signal processing
Intro to statistical signal processingIntro to statistical signal processing
Intro to statistical signal processing
 
Optimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methodsOptimum engineering design - Day 5. Clasical optimization methods
Optimum engineering design - Day 5. Clasical optimization methods
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Page rank - from theory to application
Page rank - from theory to applicationPage rank - from theory to application
Page rank - from theory to application
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Continuous control
Continuous controlContinuous control
Continuous control
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 

Recently uploaded

Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 

Recently uploaded (20)

Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 

DL_lecture3_regularization_I.pdf

  • 1. Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang
  • 2. What is regularization? • In general: any method to prevent overfitting or help the optimization • Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization
  • 4. 𝑡 = sin 2𝜋𝑥 + 𝜖 Figure from Machine Learning and Pattern Recognition, Bishop Overfitting example: regression using polynomials
  • 5. Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition, Bishop
  • 6. Overfitting • Empirical loss and expected loss are different • Smaller the data set, larger the difference between the two • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Thus has small training error but large test error (overfitting)
  • 7. Prevent overfitting • Larger data set helps • Throwing away useless hypotheses also helps • Classical regularization: some principal ways to constrain hypotheses • Other types of regularization: data augmentation, early stopping, etc.
  • 8. Different views of regularization
  • 9. Regularization as hard constraint • Training objective min 𝑓 ෠ 𝐿 𝑓 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝑓, 𝑥𝑖, 𝑦𝑖) subject to: 𝑓 ∈ 𝓗 • When parametrized min 𝜃 ෠ 𝐿 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) subject to: 𝜃 ∈ 𝛺
  • 10. Regularization as hard constraint • When 𝛺 measured by some quantity 𝑅 min 𝜃 ෠ 𝐿 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) subject to: 𝑅 𝜃 ≤ 𝑟 • Example: 𝑙2 regularization min 𝜃 ෠ 𝐿 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) subject to: | 𝜃| 2 2 ≤ 𝑟2
  • 11. Regularization as soft constraint • The hard-constraint optimization is equivalent to soft-constraint min 𝜃 ෠ 𝐿𝑅 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗𝑅(𝜃) for some regularization parameter 𝜆∗ > 0 • Example: 𝑙2 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑙(𝜃, 𝑥𝑖, 𝑦𝑖) + 𝜆∗| 𝜃| 2 2
  • 12. Regularization as soft constraint • Showed by Lagrangian multiplier method ℒ 𝜃, 𝜆 ≔ ෠ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟] • Suppose 𝜃∗ is the optimal for hard-constraint optimization 𝜃∗ = argmin 𝜃 max 𝜆≥0 ℒ 𝜃, 𝜆 ≔ ෠ 𝐿 𝜃 + 𝜆[𝑅 𝜃 − 𝑟] • Suppose 𝜆∗ is the corresponding optimal for max 𝜃∗ = argmin 𝜃 ℒ 𝜃, 𝜆∗ ≔ ෠ 𝐿 𝜃 + 𝜆∗[𝑅 𝜃 − 𝑟]
  • 13. Regularization as Bayesian prior • Bayesian view: everything is a distribution • Prior over the hypotheses: 𝑝 𝜃 • Posterior over the hypotheses: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} • Likelihood: 𝑝 𝑥𝑖, 𝑦𝑖 𝜃) • Bayesian rule: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = 𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃) 𝑝({𝑥𝑖, 𝑦𝑖})
  • 14. Regularization as Bayesian prior • Bayesian rule: 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = 𝑝 𝜃 𝑝 𝑥𝑖, 𝑦𝑖 𝜃) 𝑝({𝑥𝑖, 𝑦𝑖}) • Maximum A Posteriori (MAP): max 𝜃 log 𝑝 𝜃 | {𝑥𝑖, 𝑦𝑖} = max 𝜃 log 𝑝 𝜃 + log 𝑝 𝑥𝑖, 𝑦𝑖 | 𝜃 Regularization MLE loss
  • 15. Regularization as Bayesian prior • Example: 𝑙2 loss with 𝑙2 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑓𝜃 𝑥𝑖 − 𝑦𝑖 2 + 𝜆∗| 𝜃| 2 2 • Correspond to a normal likelihood 𝑝 𝑥, 𝑦 | 𝜃 and a normal prior 𝑝(𝜃)
  • 16. Three views • Typical choice for optimization: soft-constraint min 𝜃 ෠ 𝐿𝑅 𝜃 = ෠ 𝐿 𝜃 + 𝜆𝑅(𝜃) • Hard constraint and Bayesian view: conceptual; or used for derivation
  • 17. Three views • Hard-constraint preferred if • Know the explicit bound 𝑅 𝜃 ≤ 𝑟 • Soft-constraint causes trapped in a local minima with small 𝜃 • Projection back to feasible set leads to stability • Bayesian view preferred if • Know the prior distribution
  • 19. Classical regularization • Norm penalty • 𝑙2 regularization • 𝑙1 regularization • Robustness to noise
  • 20. 𝑙2 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = ෠ 𝐿(𝜃) + 𝛼 2 | 𝜃| 2 2 • Effect on (stochastic) gradient descent • Effect on the optimal solution
  • 21. Effect on gradient descent • Gradient of regularized objective 𝛻෠ 𝐿𝑅 𝜃 = 𝛻෠ 𝐿(𝜃) + 𝛼𝜃 • Gradient descent update 𝜃 ← 𝜃 − 𝜂𝛻෠ 𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠ 𝐿 𝜃 − 𝜂𝛼𝜃 = 1 − 𝜂𝛼 𝜃 − 𝜂 𝛻෠ 𝐿 𝜃 • Terminology: weight decay
  • 22. Effect on the optimal solution • Consider a quadratic approximation around 𝜃∗ ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗ • Since 𝜃∗ is optimal, 𝛻෠ 𝐿 𝜃∗ = 0 ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗ 𝛻෠ 𝐿 𝜃 ≈ 𝐻 𝜃 − 𝜃∗
  • 23. Effect on the optimal solution • Gradient of regularized objective 𝛻෠ 𝐿𝑅 𝜃 ≈ 𝐻 𝜃 − 𝜃∗ + 𝛼𝜃 • On the optimal 𝜃𝑅 ∗ 0 = 𝛻෠ 𝐿𝑅 𝜃𝑅 ∗ ≈ 𝐻 𝜃𝑅 ∗ − 𝜃∗ + 𝛼𝜃𝑅 ∗ 𝜃𝑅 ∗ ≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗
  • 24. Effect on the optimal solution • The optimal 𝜃𝑅 ∗ ≈ 𝐻 + 𝛼𝐼 −1𝐻𝜃∗ • Suppose 𝐻 has eigen-decomposition 𝐻 = 𝑄Λ𝑄𝑇 𝜃𝑅 ∗ ≈ 𝐻 + 𝛼𝐼 −1 𝐻𝜃∗ = 𝑄 Λ + 𝛼𝐼 −1 Λ𝑄𝑇 𝜃∗ • Effect: rescale along eigenvectors of 𝐻
  • 25. Effect on the optimal solution Figure from Deep Learning, Goodfellow, Bengio and Courville Notations: 𝜃∗ = 𝑤∗, 𝜃𝑅 ∗ = ෥ 𝑤
  • 26. 𝑙1 regularization min 𝜃 ෠ 𝐿𝑅 𝜃 = ෠ 𝐿(𝜃) + 𝛼| 𝜃 |1 • Effect on (stochastic) gradient descent • Effect on the optimal solution
  • 27. Effect on gradient descent • Gradient of regularized objective 𝛻෠ 𝐿𝑅 𝜃 = 𝛻෠ 𝐿 𝜃 + 𝛼 sign(𝜃) where sign applies to each element in 𝜃 • Gradient descent update 𝜃 ← 𝜃 − 𝜂𝛻෠ 𝐿𝑅 𝜃 = 𝜃 − 𝜂 𝛻෠ 𝐿 𝜃 − 𝜂𝛼 sign(𝜃)
  • 28. Effect on the optimal solution • Consider a quadratic approximation around 𝜃∗ ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 𝜃 − 𝜃∗ 𝑇𝛻෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗ • Since 𝜃∗ is optimal, 𝛻෠ 𝐿 𝜃∗ = 0 ෠ 𝐿 𝜃 ≈ ෠ 𝐿 𝜃∗ + 1 2 𝜃 − 𝜃∗ 𝑇𝐻 𝜃 − 𝜃∗
  • 29. Effect on the optimal solution • Further assume that 𝐻 is diagonal and positive (𝐻𝑖𝑖> 0, ∀𝑖) • not true in general but assume for getting some intuition • The regularized objective is (ignoring constants) ෠ 𝐿𝑅 𝜃 ≈ ෍ 𝑖 1 2 𝐻𝑖𝑖 𝜃𝑖 − 𝜃𝑖 ∗ 2 + 𝛼 |𝜃𝑖| • The optimal 𝜃𝑅 ∗ (𝜃𝑅 ∗ )𝑖 ≈ max 𝜃𝑖 ∗ − 𝛼 𝐻𝑖𝑖 , 0 if 𝜃𝑖 ∗ ≥ 0 min 𝜃𝑖 ∗ + 𝛼 𝐻𝑖𝑖 , 0 if 𝜃𝑖 ∗ < 0
  • 30. Effect on the optimal solution • Effect: induce sparsity − 𝛼 𝐻𝑖𝑖 𝛼 𝐻𝑖𝑖 (𝜃𝑅 ∗ )𝑖 (𝜃∗)𝑖
  • 31. Effect on the optimal solution • Further assume that 𝐻 is diagonal • Compact expression for the optimal 𝜃𝑅 ∗ (𝜃𝑅 ∗ )𝑖 ≈ sign 𝜃𝑖 ∗ max{ 𝜃𝑖 ∗ − 𝛼 𝐻𝑖𝑖 , 0}
  • 32. Bayesian view • 𝑙1 regularization corresponds to Laplacian prior 𝑝 𝜃 ∝ exp(𝛼 ෍ 𝑖 |𝜃𝑖|) log 𝑝 𝜃 = 𝛼 ෍ 𝑖 |𝜃𝑖| + constant = 𝛼| 𝜃 |1 + constant