Output Activation and Loss Functions
✓ Every neural net specifies
▪ an activation rule for the output unit(s)
▪ a loss defined in terms of the output activation
✓ First a bit of review…
Cheat Sheet 1
✓ Perceptron
▪ Activation function
▪ Weight update
✓
Linear associator (a.k.a. linear regression)
▪ Activation function
▪ Weight update
zj = wjixi
i
å yj =
1 if zj > 0
0 otherwise
ì
í
ï
î
ï
Dwji = (tj - yj )xi tj Î{0,1}
yj = wjixi
i
å
Dwji = e(tj - yj )xi tj λ
assumes minimizing
squarederror loss
assumes minimizing
numberof
misclassifications
Cheat Sheet 2
✓ Two layer net (a.k.a. logistic regression)
▪ activation function
▪ weight update
✓ Deep(er) net
▪ activation function same as above
▪ weight update
zj = wjixi
i
å yj =
1
1+ exp(-zj )
Dwji = e(tj - yj )yj (1- yj )xi tj Î 0,1
[ ]
assumes minimizing
squarederror loss
Dwji = ed j xi d j =
(tj - yj )yj (1- yj ) for output unit
wkjdk
k
å
æ
è
ç
ö
ø
÷ yj (1- yj ) for hidden unit
ì
í
ï
ï
î
ï
ï
assumes minimizing
squarederror loss
Squared Error Loss
✓ Sensible regardless of output range and output activation function
✓
▪ with logistic output unit
▪ with tanh output unit
yj =
1
1+ exp(-zj )
𝜕𝑦𝑗
𝜕𝑧𝑗
= 𝑦𝑗 1 − 𝑦𝑗
𝜕𝑦𝑗
𝜕𝑧𝑗
= 1 + 𝑦𝑗 1 − 𝑦𝑗
𝑦𝑗 = tanh 𝑧𝑗
=
2
1 + exp(−𝑧𝑗)
− 1
E =
1
2
tj - yj
( )
2
j
å 𝜕𝐸
𝜕𝑦𝑗
= 𝑦𝑗 − 𝑡𝑗
Remember
𝚫𝐰 = −𝝐
𝝏𝑬
𝝏𝒚
…
Logistic vs. Tanh
•Output = .5
▪ when no inputevidence,bias=0
•Will trigger activation in next layer
•Need large biases to neutralize
▪ biaseson differentscale than other
weights
•Does not satisfy weight
initialization assumption of mean
activation = 0
•Output = 0
▪ when no inputevidence,bias=0
•Won’t trigger activation in next
layer
•Don’t need large biases
•Satisfies weight initialization
assumption
Cross Entropy Loss
✓ Used when the target output represents a probability distribution
▪ e.g., a single output unit that indicates the classification decision (yes, no) for an
input
Output𝒚 ∈ [𝟎,𝟏] denotesBernoullilikelihoodof class membership
Target 𝒕 indicatestrue class probability(typically0 or 1)
Note: single valuerepresentsprobability distribution over 2 alternatives
✓ Cross entropy, 𝑯, measures distance in bits from predicted distribution to target
distribution
✓
𝐸 = 𝐻 = −𝑡ln 𝑦 − 1 − 𝑡 ln 1 − 𝑦
𝜕𝐸
𝜕𝑦
=
𝑦 − 𝑡
𝑦(1 − 𝑦)
Squared Error Versus Cross Entropy
𝜕𝐸sqerr
𝜕𝑧
= 𝑦 − 𝑡 𝑦 1 − 𝑦
𝜕𝐸xentropy
𝜕𝑧
= 𝑦 − 𝑡
𝜕𝐸xentropy
𝜕𝑦
=
𝑦 − 𝑡
𝑦(1 − 𝑦)
𝜕𝐸sqerr
𝜕𝑦
= 𝑦 − 𝑡
𝜕𝑦
𝜕𝑧
= 𝑦 1 − 𝑦
𝜕𝑦
𝜕𝑧
= 𝑦 1 − 𝑦
✓ Essentially,cross entropydoes not suppresslearningwhen outputis confident(near 0,1)
▪ net devotes its efforts to fitting target values exactly
▪ e.g., consider situation where 𝒚 =. 𝟗𝟗 and 𝒕 = 𝟏
Maximum Likelihood Estimation
✓ In statistics, many parameter estimation problems are formulated in terms of
maximizing the likelihood of the data
▪ find model parameters that maximize the likelihood of the data under the model
▪ e.g., 10 coin flips producing 8 heads and 2 tails
What is the coin’s bias?
✓ Likelihood formulation
▪ ℒ = 𝒚𝒕
𝟏 − 𝒚 𝟏−𝒕
for 𝒕 ∈ 𝟎,𝟏
▪ ℓ = ln ℒ = 𝒕ln 𝒚 + 𝟏 − 𝒕 ln 𝟏 − 𝒚
What’s the relationship
between ℓ and 𝑬xentropy?
Probabilistic Interpretation of Squared-Error Loss
✓ Consider a network output and target 𝒚, 𝒕 ∈ ℝ
✓ Suppose that the output is corrupted by Gaussian observation noise
▪ 𝒚 = 𝒕 + 𝜼
▪ where 𝜼 ~ Gaussian 𝟎,𝟏
✓ We can define the likelihood of the target under this noise model
▪ 𝑷 𝒕 𝒚 =
𝟏
𝟐𝝅
𝐞𝐱𝐩 −
𝟏
𝟐
𝒕 − 𝒚 𝟐
𝒚
Probabilistic Interpretation of Squared-Error Loss
✓ For a set of training examples, 𝜶 ∈ {𝟏,𝟐,𝟑, … }, we can definethe data set likelihood
▪ ℒ = ς𝜶 𝑷 𝒕𝜶
𝒚𝜶
▪ ℓ = 𝐥𝐧ℒ = σ𝜶 𝐥𝐧𝑷(𝒕𝜶
|𝒚𝜶
) where 𝑷 𝒕 𝒚 =
𝟏
𝟐𝝅
𝐞𝐱𝐩 −
𝟏
𝟐
𝒕 − 𝒚 𝟐
▪ = −
𝟏
𝟐
σ𝜶 𝒕𝜶
− 𝒚𝜶 𝟐
✓ Squarederror can be viewedas likelihoodunder Gaussian observationnoise
▪ 𝑬𝐬𝐪𝐞𝐫𝐫 = −𝒄ℓ
✓ Other noise distributionscan motivatealternativelosses.
▪ e.g., Laplace distributed noise and 𝑬𝐚𝐛𝐬𝐞𝐫𝐫 = 𝒕 − 𝒚
What is
𝝏𝑬𝐚𝐛𝐬𝐞𝐫𝐫
𝝏𝒚
?
Categorical Outputs
✓ We considered the case where the output 𝒚 denotes the probability of
class membership
▪ belonging to class 𝑨 versus ഥ
𝑨
✓ Instead of two possible categories, suppose there are n
▪ e.g., animal, vegetable, mineral
Categorical Outputs
✓ Each input can belong to one category
▪ 𝒚𝒋 denotes the probability that the input’s category is 𝒋
✓ To interpret 𝒚 as a probability distribution over the alternatives
▪ σ𝒋 𝒚𝒋 = 𝟏 and 𝟎 ≤ 𝒚𝒋 ≤ 𝟏
✓ Activation function
▪ 𝒚𝒋 =
exp 𝒛𝒋
σ𝒌 exp 𝒛𝒌
Exponentiationensuresnonnegativevalues
Denominatorensuressum to 1
✓ Known as softmax, and formerly, Luce choice rule
Derivatives For Categorical Outputs
✓ For softmax output function
✓ Weight update is the same as for two-category case!
✓ …when expressed in terms of 𝒚
yj =
exp(zj )
exp(zk )
k
å
Dwji = ed j xi d j =
¶E
¶yj
yj (1- yj ) for output unit
wkjdk
k
å
æ
è
ç
ö
ø
÷ yj (1- yj ) for hidden unit
ì
í
ï
ï
î
ï
ï
zj = wjixi
i
å
Rectified Linear Unit (ReLU)
✓ Activationfunction Derivative
▪ 𝒚 = max(𝟎, 𝒛)
𝝏𝒚
𝝏𝒛
= ቊ
𝟎 𝒛 ≤ 𝟎
𝟏 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
✓ Advantages
▪ fast to compute activationand derivatives
▪ no squashing of back propagated error signal as long as unit is activated
▪ discontinuity in derivative at z=0
▪ sparsity ?
✓ Disadvantages
▪ can potentially lead to exploding gradients and activations
▪ may waste units: units that are never activatedabove threshold won’t learn
𝒛
𝒚
Leaky ReLU
✓ Activation function Derivative
✓ Reduces to standard ReLU if 𝜶 = 𝟎
✓ Trade off
▪ 𝜶 = 𝟎 leads to inefficient use of resources (underutilized units)
▪ 𝜶 = 𝟏 lose nonlinearity essential for interesting computation
𝒛
𝒚
Softplus
✓ Activation function Derivative
▪ 𝒚 = 𝐥𝐧 𝟏 + 𝒆𝒛 𝝏𝒚
𝝏𝒛
=
𝟏
𝟏+𝒆−𝒛
= 𝐥𝐨𝐠𝐢𝐬𝐭𝐢𝐜 𝒛
✓Derivative
▪ defined everywhere
▪ zero only for 𝒛 → −∞
𝒛
𝒚
Exponential Linear Unit (ELU)
✓ Activation function Derivative
✓ Reduces to standard ReLU if 𝜶 = 𝟎
𝒛
𝒚
𝒛
𝒚
Radial Basis Functions
✓ Activation function
▪ 𝒚 = exp − 𝒙 − 𝒘 𝟐
✓ Sparse activation
▪ many units just don’t learn
▪ same issue as ReLUs
✓ Clever schemes to initialize weights
▪ e.g., set 𝒘 near cluster of 𝒙’s
𝒙
𝒘
Image credits: www.dtreg.com
bio.felk.cvut.cz
playground.tensorflow.org

activation_loss (1).pdf

  • 1.
    Output Activation andLoss Functions ✓ Every neural net specifies ▪ an activation rule for the output unit(s) ▪ a loss defined in terms of the output activation ✓ First a bit of review…
  • 2.
    Cheat Sheet 1 ✓Perceptron ▪ Activation function ▪ Weight update ✓ Linear associator (a.k.a. linear regression) ▪ Activation function ▪ Weight update zj = wjixi i å yj = 1 if zj > 0 0 otherwise ì í ï î ï Dwji = (tj - yj )xi tj Î{0,1} yj = wjixi i å Dwji = e(tj - yj )xi tj λ assumes minimizing squarederror loss assumes minimizing numberof misclassifications
  • 3.
    Cheat Sheet 2 ✓Two layer net (a.k.a. logistic regression) ▪ activation function ▪ weight update ✓ Deep(er) net ▪ activation function same as above ▪ weight update zj = wjixi i å yj = 1 1+ exp(-zj ) Dwji = e(tj - yj )yj (1- yj )xi tj Î 0,1 [ ] assumes minimizing squarederror loss Dwji = ed j xi d j = (tj - yj )yj (1- yj ) for output unit wkjdk k å æ è ç ö ø ÷ yj (1- yj ) for hidden unit ì í ï ï î ï ï assumes minimizing squarederror loss
  • 4.
    Squared Error Loss ✓Sensible regardless of output range and output activation function ✓ ▪ with logistic output unit ▪ with tanh output unit yj = 1 1+ exp(-zj ) 𝜕𝑦𝑗 𝜕𝑧𝑗 = 𝑦𝑗 1 − 𝑦𝑗 𝜕𝑦𝑗 𝜕𝑧𝑗 = 1 + 𝑦𝑗 1 − 𝑦𝑗 𝑦𝑗 = tanh 𝑧𝑗 = 2 1 + exp(−𝑧𝑗) − 1 E = 1 2 tj - yj ( ) 2 j å 𝜕𝐸 𝜕𝑦𝑗 = 𝑦𝑗 − 𝑡𝑗 Remember 𝚫𝐰 = −𝝐 𝝏𝑬 𝝏𝒚 …
  • 5.
    Logistic vs. Tanh •Output= .5 ▪ when no inputevidence,bias=0 •Will trigger activation in next layer •Need large biases to neutralize ▪ biaseson differentscale than other weights •Does not satisfy weight initialization assumption of mean activation = 0 •Output = 0 ▪ when no inputevidence,bias=0 •Won’t trigger activation in next layer •Don’t need large biases •Satisfies weight initialization assumption
  • 6.
    Cross Entropy Loss ✓Used when the target output represents a probability distribution ▪ e.g., a single output unit that indicates the classification decision (yes, no) for an input Output𝒚 ∈ [𝟎,𝟏] denotesBernoullilikelihoodof class membership Target 𝒕 indicatestrue class probability(typically0 or 1) Note: single valuerepresentsprobability distribution over 2 alternatives ✓ Cross entropy, 𝑯, measures distance in bits from predicted distribution to target distribution ✓ 𝐸 = 𝐻 = −𝑡ln 𝑦 − 1 − 𝑡 ln 1 − 𝑦 𝜕𝐸 𝜕𝑦 = 𝑦 − 𝑡 𝑦(1 − 𝑦)
  • 7.
    Squared Error VersusCross Entropy 𝜕𝐸sqerr 𝜕𝑧 = 𝑦 − 𝑡 𝑦 1 − 𝑦 𝜕𝐸xentropy 𝜕𝑧 = 𝑦 − 𝑡 𝜕𝐸xentropy 𝜕𝑦 = 𝑦 − 𝑡 𝑦(1 − 𝑦) 𝜕𝐸sqerr 𝜕𝑦 = 𝑦 − 𝑡 𝜕𝑦 𝜕𝑧 = 𝑦 1 − 𝑦 𝜕𝑦 𝜕𝑧 = 𝑦 1 − 𝑦 ✓ Essentially,cross entropydoes not suppresslearningwhen outputis confident(near 0,1) ▪ net devotes its efforts to fitting target values exactly ▪ e.g., consider situation where 𝒚 =. 𝟗𝟗 and 𝒕 = 𝟏
  • 8.
    Maximum Likelihood Estimation ✓In statistics, many parameter estimation problems are formulated in terms of maximizing the likelihood of the data ▪ find model parameters that maximize the likelihood of the data under the model ▪ e.g., 10 coin flips producing 8 heads and 2 tails What is the coin’s bias? ✓ Likelihood formulation ▪ ℒ = 𝒚𝒕 𝟏 − 𝒚 𝟏−𝒕 for 𝒕 ∈ 𝟎,𝟏 ▪ ℓ = ln ℒ = 𝒕ln 𝒚 + 𝟏 − 𝒕 ln 𝟏 − 𝒚 What’s the relationship between ℓ and 𝑬xentropy?
  • 9.
    Probabilistic Interpretation ofSquared-Error Loss ✓ Consider a network output and target 𝒚, 𝒕 ∈ ℝ ✓ Suppose that the output is corrupted by Gaussian observation noise ▪ 𝒚 = 𝒕 + 𝜼 ▪ where 𝜼 ~ Gaussian 𝟎,𝟏 ✓ We can define the likelihood of the target under this noise model ▪ 𝑷 𝒕 𝒚 = 𝟏 𝟐𝝅 𝐞𝐱𝐩 − 𝟏 𝟐 𝒕 − 𝒚 𝟐 𝒚
  • 10.
    Probabilistic Interpretation ofSquared-Error Loss ✓ For a set of training examples, 𝜶 ∈ {𝟏,𝟐,𝟑, … }, we can definethe data set likelihood ▪ ℒ = ς𝜶 𝑷 𝒕𝜶 𝒚𝜶 ▪ ℓ = 𝐥𝐧ℒ = σ𝜶 𝐥𝐧𝑷(𝒕𝜶 |𝒚𝜶 ) where 𝑷 𝒕 𝒚 = 𝟏 𝟐𝝅 𝐞𝐱𝐩 − 𝟏 𝟐 𝒕 − 𝒚 𝟐 ▪ = − 𝟏 𝟐 σ𝜶 𝒕𝜶 − 𝒚𝜶 𝟐 ✓ Squarederror can be viewedas likelihoodunder Gaussian observationnoise ▪ 𝑬𝐬𝐪𝐞𝐫𝐫 = −𝒄ℓ ✓ Other noise distributionscan motivatealternativelosses. ▪ e.g., Laplace distributed noise and 𝑬𝐚𝐛𝐬𝐞𝐫𝐫 = 𝒕 − 𝒚 What is 𝝏𝑬𝐚𝐛𝐬𝐞𝐫𝐫 𝝏𝒚 ?
  • 11.
    Categorical Outputs ✓ Weconsidered the case where the output 𝒚 denotes the probability of class membership ▪ belonging to class 𝑨 versus ഥ 𝑨 ✓ Instead of two possible categories, suppose there are n ▪ e.g., animal, vegetable, mineral
  • 12.
    Categorical Outputs ✓ Eachinput can belong to one category ▪ 𝒚𝒋 denotes the probability that the input’s category is 𝒋 ✓ To interpret 𝒚 as a probability distribution over the alternatives ▪ σ𝒋 𝒚𝒋 = 𝟏 and 𝟎 ≤ 𝒚𝒋 ≤ 𝟏 ✓ Activation function ▪ 𝒚𝒋 = exp 𝒛𝒋 σ𝒌 exp 𝒛𝒌 Exponentiationensuresnonnegativevalues Denominatorensuressum to 1 ✓ Known as softmax, and formerly, Luce choice rule
  • 13.
    Derivatives For CategoricalOutputs ✓ For softmax output function ✓ Weight update is the same as for two-category case! ✓ …when expressed in terms of 𝒚 yj = exp(zj ) exp(zk ) k å Dwji = ed j xi d j = ¶E ¶yj yj (1- yj ) for output unit wkjdk k å æ è ç ö ø ÷ yj (1- yj ) for hidden unit ì í ï ï î ï ï zj = wjixi i å
  • 14.
    Rectified Linear Unit(ReLU) ✓ Activationfunction Derivative ▪ 𝒚 = max(𝟎, 𝒛) 𝝏𝒚 𝝏𝒛 = ቊ 𝟎 𝒛 ≤ 𝟎 𝟏 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞 ✓ Advantages ▪ fast to compute activationand derivatives ▪ no squashing of back propagated error signal as long as unit is activated ▪ discontinuity in derivative at z=0 ▪ sparsity ? ✓ Disadvantages ▪ can potentially lead to exploding gradients and activations ▪ may waste units: units that are never activatedabove threshold won’t learn 𝒛 𝒚
  • 15.
    Leaky ReLU ✓ Activationfunction Derivative ✓ Reduces to standard ReLU if 𝜶 = 𝟎 ✓ Trade off ▪ 𝜶 = 𝟎 leads to inefficient use of resources (underutilized units) ▪ 𝜶 = 𝟏 lose nonlinearity essential for interesting computation 𝒛 𝒚
  • 16.
    Softplus ✓ Activation functionDerivative ▪ 𝒚 = 𝐥𝐧 𝟏 + 𝒆𝒛 𝝏𝒚 𝝏𝒛 = 𝟏 𝟏+𝒆−𝒛 = 𝐥𝐨𝐠𝐢𝐬𝐭𝐢𝐜 𝒛 ✓Derivative ▪ defined everywhere ▪ zero only for 𝒛 → −∞ 𝒛 𝒚
  • 17.
    Exponential Linear Unit(ELU) ✓ Activation function Derivative ✓ Reduces to standard ReLU if 𝜶 = 𝟎 𝒛 𝒚 𝒛 𝒚
  • 18.
    Radial Basis Functions ✓Activation function ▪ 𝒚 = exp − 𝒙 − 𝒘 𝟐 ✓ Sparse activation ▪ many units just don’t learn ▪ same issue as ReLUs ✓ Clever schemes to initialize weights ▪ e.g., set 𝒘 near cluster of 𝒙’s 𝒙 𝒘 Image credits: www.dtreg.com bio.felk.cvut.cz
  • 19.