SlideShare a Scribd company logo
Output Activation and Loss Functions
✓ Every neural net specifies
▪ an activation rule for the output unit(s)
▪ a loss defined in terms of the output activation
✓ First a bit of review…
Cheat Sheet 1
✓ Perceptron
▪ Activation function
▪ Weight update
✓
Linear associator (a.k.a. linear regression)
▪ Activation function
▪ Weight update
zj = wjixi
i
å yj =
1 if zj > 0
0 otherwise
ì
í
ï
î
ï
Dwji = (tj - yj )xi tj Î{0,1}
yj = wjixi
i
å
Dwji = e(tj - yj )xi tj λ
assumes minimizing
squarederror loss
assumes minimizing
numberof
misclassifications
Cheat Sheet 2
✓ Two layer net (a.k.a. logistic regression)
▪ activation function
▪ weight update
✓ Deep(er) net
▪ activation function same as above
▪ weight update
zj = wjixi
i
å yj =
1
1+ exp(-zj )
Dwji = e(tj - yj )yj (1- yj )xi tj Î 0,1
[ ]
assumes minimizing
squarederror loss
Dwji = ed j xi d j =
(tj - yj )yj (1- yj ) for output unit
wkjdk
k
å
æ
è
ç
ö
ø
÷ yj (1- yj ) for hidden unit
ì
í
ï
ï
î
ï
ï
assumes minimizing
squarederror loss
Squared Error Loss
✓ Sensible regardless of output range and output activation function
✓
▪ with logistic output unit
▪ with tanh output unit
yj =
1
1+ exp(-zj )
𝜕𝑦𝑗
𝜕𝑧𝑗
= 𝑦𝑗 1 − 𝑦𝑗
𝜕𝑦𝑗
𝜕𝑧𝑗
= 1 + 𝑦𝑗 1 − 𝑦𝑗
𝑦𝑗 = tanh 𝑧𝑗
=
2
1 + exp(−𝑧𝑗)
− 1
E =
1
2
tj - yj
( )
2
j
å 𝜕𝐸
𝜕𝑦𝑗
= 𝑦𝑗 − 𝑡𝑗
Remember
𝚫𝐰 = −𝝐
𝝏𝑬
𝝏𝒚
…
Logistic vs. Tanh
•Output = .5
▪ when no inputevidence,bias=0
•Will trigger activation in next layer
•Need large biases to neutralize
▪ biaseson differentscale than other
weights
•Does not satisfy weight
initialization assumption of mean
activation = 0
•Output = 0
▪ when no inputevidence,bias=0
•Won’t trigger activation in next
layer
•Don’t need large biases
•Satisfies weight initialization
assumption
Cross Entropy Loss
✓ Used when the target output represents a probability distribution
▪ e.g., a single output unit that indicates the classification decision (yes, no) for an
input
Output𝒚 ∈ [𝟎,𝟏] denotesBernoullilikelihoodof class membership
Target 𝒕 indicatestrue class probability(typically0 or 1)
Note: single valuerepresentsprobability distribution over 2 alternatives
✓ Cross entropy, 𝑯, measures distance in bits from predicted distribution to target
distribution
✓
𝐸 = 𝐻 = −𝑡ln 𝑦 − 1 − 𝑡 ln 1 − 𝑦
𝜕𝐸
𝜕𝑦
=
𝑦 − 𝑡
𝑦(1 − 𝑦)
Squared Error Versus Cross Entropy
𝜕𝐸sqerr
𝜕𝑧
= 𝑦 − 𝑡 𝑦 1 − 𝑦
𝜕𝐸xentropy
𝜕𝑧
= 𝑦 − 𝑡
𝜕𝐸xentropy
𝜕𝑦
=
𝑦 − 𝑡
𝑦(1 − 𝑦)
𝜕𝐸sqerr
𝜕𝑦
= 𝑦 − 𝑡
𝜕𝑦
𝜕𝑧
= 𝑦 1 − 𝑦
𝜕𝑦
𝜕𝑧
= 𝑦 1 − 𝑦
✓ Essentially,cross entropydoes not suppresslearningwhen outputis confident(near 0,1)
▪ net devotes its efforts to fitting target values exactly
▪ e.g., consider situation where 𝒚 =. 𝟗𝟗 and 𝒕 = 𝟏
Maximum Likelihood Estimation
✓ In statistics, many parameter estimation problems are formulated in terms of
maximizing the likelihood of the data
▪ find model parameters that maximize the likelihood of the data under the model
▪ e.g., 10 coin flips producing 8 heads and 2 tails
What is the coin’s bias?
✓ Likelihood formulation
▪ ℒ = 𝒚𝒕
𝟏 − 𝒚 𝟏−𝒕
for 𝒕 ∈ 𝟎,𝟏
▪ ℓ = ln ℒ = 𝒕ln 𝒚 + 𝟏 − 𝒕 ln 𝟏 − 𝒚
What’s the relationship
between ℓ and 𝑬xentropy?
Probabilistic Interpretation of Squared-Error Loss
✓ Consider a network output and target 𝒚, 𝒕 ∈ ℝ
✓ Suppose that the output is corrupted by Gaussian observation noise
▪ 𝒚 = 𝒕 + 𝜼
▪ where 𝜼 ~ Gaussian 𝟎,𝟏
✓ We can define the likelihood of the target under this noise model
▪ 𝑷 𝒕 𝒚 =
𝟏
𝟐𝝅
𝐞𝐱𝐩 −
𝟏
𝟐
𝒕 − 𝒚 𝟐
𝒚
Probabilistic Interpretation of Squared-Error Loss
✓ For a set of training examples, 𝜶 ∈ {𝟏,𝟐,𝟑, … }, we can definethe data set likelihood
▪ ℒ = ς𝜶 𝑷 𝒕𝜶
𝒚𝜶
▪ ℓ = 𝐥𝐧ℒ = σ𝜶 𝐥𝐧𝑷(𝒕𝜶
|𝒚𝜶
) where 𝑷 𝒕 𝒚 =
𝟏
𝟐𝝅
𝐞𝐱𝐩 −
𝟏
𝟐
𝒕 − 𝒚 𝟐
▪ = −
𝟏
𝟐
σ𝜶 𝒕𝜶
− 𝒚𝜶 𝟐
✓ Squarederror can be viewedas likelihoodunder Gaussian observationnoise
▪ 𝑬𝐬𝐪𝐞𝐫𝐫 = −𝒄ℓ
✓ Other noise distributionscan motivatealternativelosses.
▪ e.g., Laplace distributed noise and 𝑬𝐚𝐛𝐬𝐞𝐫𝐫 = 𝒕 − 𝒚
What is
𝝏𝑬𝐚𝐛𝐬𝐞𝐫𝐫
𝝏𝒚
?
Categorical Outputs
✓ We considered the case where the output 𝒚 denotes the probability of
class membership
▪ belonging to class 𝑨 versus ഥ
𝑨
✓ Instead of two possible categories, suppose there are n
▪ e.g., animal, vegetable, mineral
Categorical Outputs
✓ Each input can belong to one category
▪ 𝒚𝒋 denotes the probability that the input’s category is 𝒋
✓ To interpret 𝒚 as a probability distribution over the alternatives
▪ σ𝒋 𝒚𝒋 = 𝟏 and 𝟎 ≤ 𝒚𝒋 ≤ 𝟏
✓ Activation function
▪ 𝒚𝒋 =
exp 𝒛𝒋
σ𝒌 exp 𝒛𝒌
Exponentiationensuresnonnegativevalues
Denominatorensuressum to 1
✓ Known as softmax, and formerly, Luce choice rule
Derivatives For Categorical Outputs
✓ For softmax output function
✓ Weight update is the same as for two-category case!
✓ …when expressed in terms of 𝒚
yj =
exp(zj )
exp(zk )
k
å
Dwji = ed j xi d j =
¶E
¶yj
yj (1- yj ) for output unit
wkjdk
k
å
æ
è
ç
ö
ø
÷ yj (1- yj ) for hidden unit
ì
í
ï
ï
î
ï
ï
zj = wjixi
i
å
Rectified Linear Unit (ReLU)
✓ Activationfunction Derivative
▪ 𝒚 = max(𝟎, 𝒛)
𝝏𝒚
𝝏𝒛
= ቊ
𝟎 𝒛 ≤ 𝟎
𝟏 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
✓ Advantages
▪ fast to compute activationand derivatives
▪ no squashing of back propagated error signal as long as unit is activated
▪ discontinuity in derivative at z=0
▪ sparsity ?
✓ Disadvantages
▪ can potentially lead to exploding gradients and activations
▪ may waste units: units that are never activatedabove threshold won’t learn
𝒛
𝒚
Leaky ReLU
✓ Activation function Derivative
✓ Reduces to standard ReLU if 𝜶 = 𝟎
✓ Trade off
▪ 𝜶 = 𝟎 leads to inefficient use of resources (underutilized units)
▪ 𝜶 = 𝟏 lose nonlinearity essential for interesting computation
𝒛
𝒚
Softplus
✓ Activation function Derivative
▪ 𝒚 = 𝐥𝐧 𝟏 + 𝒆𝒛 𝝏𝒚
𝝏𝒛
=
𝟏
𝟏+𝒆−𝒛
= 𝐥𝐨𝐠𝐢𝐬𝐭𝐢𝐜 𝒛
✓Derivative
▪ defined everywhere
▪ zero only for 𝒛 → −∞
𝒛
𝒚
Exponential Linear Unit (ELU)
✓ Activation function Derivative
✓ Reduces to standard ReLU if 𝜶 = 𝟎
𝒛
𝒚
𝒛
𝒚
Radial Basis Functions
✓ Activation function
▪ 𝒚 = exp − 𝒙 − 𝒘 𝟐
✓ Sparse activation
▪ many units just don’t learn
▪ same issue as ReLUs
✓ Clever schemes to initialize weights
▪ e.g., set 𝒘 near cluster of 𝒙’s
𝒙
𝒘
Image credits: www.dtreg.com
bio.felk.cvut.cz
playground.tensorflow.org

More Related Content

Similar to activation_loss (1).pdf

Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
simaokasonse
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
Jun Young Park
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
UmeshJagga1
 
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Software
 
Lec10.pptx
Lec10.pptxLec10.pptx
Lec10.pptx
AbrahamTadesse11
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
Dai-Hai Nguyen
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
publisher in research
publisher in researchpublisher in research
publisher in research
rikaseorika
 
Neural Networks
Neural NetworksNeural Networks
Mathematics of nyquist plot [autosaved] [autosaved]
Mathematics of nyquist plot [autosaved] [autosaved]Mathematics of nyquist plot [autosaved] [autosaved]
Mathematics of nyquist plot [autosaved] [autosaved]
Asafak Husain
 
Introduction to neural networks
Introduction to neural networks Introduction to neural networks
Introduction to neural networks
Ahmad Hammoudeh
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
Junho Kim
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
Masa Kato
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
Hojin Yang
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
Revanth Kumar
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
ChenYiHuang5
 
Face verification techniques: how to speed up dataset creation
Face verification techniques: how to speed up dataset creationFace verification techniques: how to speed up dataset creation
Face verification techniques: how to speed up dataset creation
Deep Learning Italia
 
SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)
SEMINARGROOT
 

Similar to activation_loss (1).pdf (20)

Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Machine learning introduction lecture notes
Machine learning introduction lecture notesMachine learning introduction lecture notes
Machine learning introduction lecture notes
 
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
 
Lec10.pptx
Lec10.pptxLec10.pptx
Lec10.pptx
 
DNN.pptx
DNN.pptxDNN.pptx
DNN.pptx
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
publisher in research
publisher in researchpublisher in research
publisher in research
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Mathematics of nyquist plot [autosaved] [autosaved]
Mathematics of nyquist plot [autosaved] [autosaved]Mathematics of nyquist plot [autosaved] [autosaved]
Mathematics of nyquist plot [autosaved] [autosaved]
 
Introduction to neural networks
Introduction to neural networks Introduction to neural networks
Introduction to neural networks
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Face verification techniques: how to speed up dataset creation
Face verification techniques: how to speed up dataset creationFace verification techniques: how to speed up dataset creation
Face verification techniques: how to speed up dataset creation
 
SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)
 

Recently uploaded

Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
ambekarshweta25
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 

Recently uploaded (20)

Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
An Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering TechniquesAn Approach to Detecting Writing Styles Based on Clustering Techniques
An Approach to Detecting Writing Styles Based on Clustering Techniques
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 

activation_loss (1).pdf

  • 1. Output Activation and Loss Functions ✓ Every neural net specifies ▪ an activation rule for the output unit(s) ▪ a loss defined in terms of the output activation ✓ First a bit of review…
  • 2. Cheat Sheet 1 ✓ Perceptron ▪ Activation function ▪ Weight update ✓ Linear associator (a.k.a. linear regression) ▪ Activation function ▪ Weight update zj = wjixi i å yj = 1 if zj > 0 0 otherwise ì í ï î ï Dwji = (tj - yj )xi tj Î{0,1} yj = wjixi i å Dwji = e(tj - yj )xi tj λ assumes minimizing squarederror loss assumes minimizing numberof misclassifications
  • 3. Cheat Sheet 2 ✓ Two layer net (a.k.a. logistic regression) ▪ activation function ▪ weight update ✓ Deep(er) net ▪ activation function same as above ▪ weight update zj = wjixi i å yj = 1 1+ exp(-zj ) Dwji = e(tj - yj )yj (1- yj )xi tj Î 0,1 [ ] assumes minimizing squarederror loss Dwji = ed j xi d j = (tj - yj )yj (1- yj ) for output unit wkjdk k å æ è ç ö ø ÷ yj (1- yj ) for hidden unit ì í ï ï î ï ï assumes minimizing squarederror loss
  • 4. Squared Error Loss ✓ Sensible regardless of output range and output activation function ✓ ▪ with logistic output unit ▪ with tanh output unit yj = 1 1+ exp(-zj ) 𝜕𝑦𝑗 𝜕𝑧𝑗 = 𝑦𝑗 1 − 𝑦𝑗 𝜕𝑦𝑗 𝜕𝑧𝑗 = 1 + 𝑦𝑗 1 − 𝑦𝑗 𝑦𝑗 = tanh 𝑧𝑗 = 2 1 + exp(−𝑧𝑗) − 1 E = 1 2 tj - yj ( ) 2 j å 𝜕𝐸 𝜕𝑦𝑗 = 𝑦𝑗 − 𝑡𝑗 Remember 𝚫𝐰 = −𝝐 𝝏𝑬 𝝏𝒚 …
  • 5. Logistic vs. Tanh •Output = .5 ▪ when no inputevidence,bias=0 •Will trigger activation in next layer •Need large biases to neutralize ▪ biaseson differentscale than other weights •Does not satisfy weight initialization assumption of mean activation = 0 •Output = 0 ▪ when no inputevidence,bias=0 •Won’t trigger activation in next layer •Don’t need large biases •Satisfies weight initialization assumption
  • 6. Cross Entropy Loss ✓ Used when the target output represents a probability distribution ▪ e.g., a single output unit that indicates the classification decision (yes, no) for an input Output𝒚 ∈ [𝟎,𝟏] denotesBernoullilikelihoodof class membership Target 𝒕 indicatestrue class probability(typically0 or 1) Note: single valuerepresentsprobability distribution over 2 alternatives ✓ Cross entropy, 𝑯, measures distance in bits from predicted distribution to target distribution ✓ 𝐸 = 𝐻 = −𝑡ln 𝑦 − 1 − 𝑡 ln 1 − 𝑦 𝜕𝐸 𝜕𝑦 = 𝑦 − 𝑡 𝑦(1 − 𝑦)
  • 7. Squared Error Versus Cross Entropy 𝜕𝐸sqerr 𝜕𝑧 = 𝑦 − 𝑡 𝑦 1 − 𝑦 𝜕𝐸xentropy 𝜕𝑧 = 𝑦 − 𝑡 𝜕𝐸xentropy 𝜕𝑦 = 𝑦 − 𝑡 𝑦(1 − 𝑦) 𝜕𝐸sqerr 𝜕𝑦 = 𝑦 − 𝑡 𝜕𝑦 𝜕𝑧 = 𝑦 1 − 𝑦 𝜕𝑦 𝜕𝑧 = 𝑦 1 − 𝑦 ✓ Essentially,cross entropydoes not suppresslearningwhen outputis confident(near 0,1) ▪ net devotes its efforts to fitting target values exactly ▪ e.g., consider situation where 𝒚 =. 𝟗𝟗 and 𝒕 = 𝟏
  • 8. Maximum Likelihood Estimation ✓ In statistics, many parameter estimation problems are formulated in terms of maximizing the likelihood of the data ▪ find model parameters that maximize the likelihood of the data under the model ▪ e.g., 10 coin flips producing 8 heads and 2 tails What is the coin’s bias? ✓ Likelihood formulation ▪ ℒ = 𝒚𝒕 𝟏 − 𝒚 𝟏−𝒕 for 𝒕 ∈ 𝟎,𝟏 ▪ ℓ = ln ℒ = 𝒕ln 𝒚 + 𝟏 − 𝒕 ln 𝟏 − 𝒚 What’s the relationship between ℓ and 𝑬xentropy?
  • 9. Probabilistic Interpretation of Squared-Error Loss ✓ Consider a network output and target 𝒚, 𝒕 ∈ ℝ ✓ Suppose that the output is corrupted by Gaussian observation noise ▪ 𝒚 = 𝒕 + 𝜼 ▪ where 𝜼 ~ Gaussian 𝟎,𝟏 ✓ We can define the likelihood of the target under this noise model ▪ 𝑷 𝒕 𝒚 = 𝟏 𝟐𝝅 𝐞𝐱𝐩 − 𝟏 𝟐 𝒕 − 𝒚 𝟐 𝒚
  • 10. Probabilistic Interpretation of Squared-Error Loss ✓ For a set of training examples, 𝜶 ∈ {𝟏,𝟐,𝟑, … }, we can definethe data set likelihood ▪ ℒ = ς𝜶 𝑷 𝒕𝜶 𝒚𝜶 ▪ ℓ = 𝐥𝐧ℒ = σ𝜶 𝐥𝐧𝑷(𝒕𝜶 |𝒚𝜶 ) where 𝑷 𝒕 𝒚 = 𝟏 𝟐𝝅 𝐞𝐱𝐩 − 𝟏 𝟐 𝒕 − 𝒚 𝟐 ▪ = − 𝟏 𝟐 σ𝜶 𝒕𝜶 − 𝒚𝜶 𝟐 ✓ Squarederror can be viewedas likelihoodunder Gaussian observationnoise ▪ 𝑬𝐬𝐪𝐞𝐫𝐫 = −𝒄ℓ ✓ Other noise distributionscan motivatealternativelosses. ▪ e.g., Laplace distributed noise and 𝑬𝐚𝐛𝐬𝐞𝐫𝐫 = 𝒕 − 𝒚 What is 𝝏𝑬𝐚𝐛𝐬𝐞𝐫𝐫 𝝏𝒚 ?
  • 11. Categorical Outputs ✓ We considered the case where the output 𝒚 denotes the probability of class membership ▪ belonging to class 𝑨 versus ഥ 𝑨 ✓ Instead of two possible categories, suppose there are n ▪ e.g., animal, vegetable, mineral
  • 12. Categorical Outputs ✓ Each input can belong to one category ▪ 𝒚𝒋 denotes the probability that the input’s category is 𝒋 ✓ To interpret 𝒚 as a probability distribution over the alternatives ▪ σ𝒋 𝒚𝒋 = 𝟏 and 𝟎 ≤ 𝒚𝒋 ≤ 𝟏 ✓ Activation function ▪ 𝒚𝒋 = exp 𝒛𝒋 σ𝒌 exp 𝒛𝒌 Exponentiationensuresnonnegativevalues Denominatorensuressum to 1 ✓ Known as softmax, and formerly, Luce choice rule
  • 13. Derivatives For Categorical Outputs ✓ For softmax output function ✓ Weight update is the same as for two-category case! ✓ …when expressed in terms of 𝒚 yj = exp(zj ) exp(zk ) k å Dwji = ed j xi d j = ¶E ¶yj yj (1- yj ) for output unit wkjdk k å æ è ç ö ø ÷ yj (1- yj ) for hidden unit ì í ï ï î ï ï zj = wjixi i å
  • 14. Rectified Linear Unit (ReLU) ✓ Activationfunction Derivative ▪ 𝒚 = max(𝟎, 𝒛) 𝝏𝒚 𝝏𝒛 = ቊ 𝟎 𝒛 ≤ 𝟎 𝟏 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞 ✓ Advantages ▪ fast to compute activationand derivatives ▪ no squashing of back propagated error signal as long as unit is activated ▪ discontinuity in derivative at z=0 ▪ sparsity ? ✓ Disadvantages ▪ can potentially lead to exploding gradients and activations ▪ may waste units: units that are never activatedabove threshold won’t learn 𝒛 𝒚
  • 15. Leaky ReLU ✓ Activation function Derivative ✓ Reduces to standard ReLU if 𝜶 = 𝟎 ✓ Trade off ▪ 𝜶 = 𝟎 leads to inefficient use of resources (underutilized units) ▪ 𝜶 = 𝟏 lose nonlinearity essential for interesting computation 𝒛 𝒚
  • 16. Softplus ✓ Activation function Derivative ▪ 𝒚 = 𝐥𝐧 𝟏 + 𝒆𝒛 𝝏𝒚 𝝏𝒛 = 𝟏 𝟏+𝒆−𝒛 = 𝐥𝐨𝐠𝐢𝐬𝐭𝐢𝐜 𝒛 ✓Derivative ▪ defined everywhere ▪ zero only for 𝒛 → −∞ 𝒛 𝒚
  • 17. Exponential Linear Unit (ELU) ✓ Activation function Derivative ✓ Reduces to standard ReLU if 𝜶 = 𝟎 𝒛 𝒚 𝒛 𝒚
  • 18. Radial Basis Functions ✓ Activation function ▪ 𝒚 = exp − 𝒙 − 𝒘 𝟐 ✓ Sparse activation ▪ many units just don’t learn ▪ same issue as ReLUs ✓ Clever schemes to initialize weights ▪ e.g., set 𝒘 near cluster of 𝒙’s 𝒙 𝒘 Image credits: www.dtreg.com bio.felk.cvut.cz