SlideShare a Scribd company logo
1 of 53
Logistic Regression
Jia-Bin Huang
Virginia Tech Spring 2019
ECE-5424G / CS-5824
Administrative
• Please start HW 1 early!
• Questions are welcome!
Two principles for estimating parameters
•Maximum Likelihood Estimate (MLE)
Choose 𝜃 that maximizes probability of observed data
𝜽MLE
= argmax
𝜃
𝑃(𝐷𝑎𝑡𝑎|𝜃)
•Maximum a posteriori estimation (MAP)
Choose 𝜃 that is most probable given prior probability and
data
𝜽MAP
= argmax
𝜃
𝑃 𝜃 𝐷 = argmax
𝜃
𝑃 𝐷𝑎𝑡𝑎 𝜃 𝑃 𝜃
𝑃(𝐷𝑎𝑡𝑎)
Slide credit: Tom Mitchell
Naïve Bayes classifier
• Want to learn 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛)
• But require 𝟐𝒏 parameters...
• How about applying Bayes rule?
• 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛) =
𝑃(𝑋1,⋯,𝑋𝑛 𝑌 𝑃 𝑌
𝑃(𝑋1,⋯,𝑋𝑛)
∝ 𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 𝑃 𝑌
• 𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 : Need (𝟐𝒏
−𝟏) × 𝟐 parameters
• 𝑃(𝑌): Need 1 parameter
• Apply conditional independence assumption
• 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑗=1
𝑛
𝑃(𝑋𝑗|𝑌): Need 𝐧 × 𝟐 parameters
Naïve Bayes classifier
• Bayes rule:
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) =
𝑃(𝑌 = 𝑦𝑘)𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑘
𝑗 𝑃 𝑌 = 𝑦𝑗 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑗
• Assume conditional independence among 𝑋𝑖’s:
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) =
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
𝑗 𝑃 𝑌 = 𝑦𝑗 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑗)
• Pick the most probable Y
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
Slide credit: Tom Mitchell
Example
• 𝑃 𝑌 𝑋1, 𝑋2 ∝ 𝑃 𝑌 𝑃 𝑋1, 𝑋2 𝑌 = 𝑃 𝑌 𝑃 𝑋1 𝑌 𝑃(𝑋2 𝑌
• Estimating parameters
• Test example: 𝑋1 = 1, 𝑋2 = 0
• 𝑌 = 1: 𝑃 𝑌 = 1 𝑃 𝑋1 = 1|𝑌 = 1 𝑃 𝑋2 = 0|𝑌 = 1 = 0.4 × 0.2 × 0.7 = 0.056
• 𝑌 = 0: 𝑃 𝑌 = 0 𝑃 𝑋1 = 1|𝑌 = 0 𝑃 𝑋2 = 0|𝑌 = 0 = 0.6 × 0.7 × 0.1 = 0.042
Bayes rule Conditional indep.
𝑃 𝑌 = 1 = 0.4
𝑃 𝑋1 = 1|𝑌 = 1 = 0.2
𝑃 𝑋1 = 1|𝑌 = 0 = 0.7
𝑃 𝑋2 = 1|𝑌 = 1 = 0.3
𝑃 𝑋2 = 1|𝑌 = 0 = 0.9
𝑃 𝑌 = 0 = 0.6
𝑃 𝑋1 = 0|𝑌 = 1 = 0.8
𝑃 𝑋1 = 0|𝑌 = 0 = 0.3
𝑃 𝑋2 = 0|𝑌 = 1 = 0.7
𝑃 𝑋2 = 0|𝑌 = 0 = 0.1
Naïve Bayes algorithm – discrete Xi
• For each value yk
Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each value xij of each attribute Xi
Estimate 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)
• Classify Xtest
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖
test
𝑌 = 𝑦𝑘)
𝑌 ← argmax
𝑦𝑘
𝜋𝑘 Π𝑖𝜃𝑖𝑗𝑘
Slide credit: Tom Mitchell
Estimating parameters: discrete 𝑌, 𝑋𝑖
• Maximum likelihood estimates (MLE)
𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =
#𝐷 𝑌 = 𝑦𝑘
𝐷
𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =
#𝐷 𝑋𝑖 = 𝑥𝑖𝑗 ^ 𝑌 = 𝑦𝑘
#𝐷{𝑌 = 𝑦𝑘}
Slide credit: Tom Mitchell
• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month
𝑃 𝐹 = 1 =
𝑃 𝑆 = 1|𝐹 = 1 =
𝑃 𝑆 = 1|𝐹 = 0 =
𝑃 𝐷 = 1|𝐹 = 1 =
𝑃 𝐷 = 1|𝐹 = 0 =
𝑃 𝐺 = 1|𝐹 = 1 =
𝑃 𝐺 = 1|𝐹 = 0 =
𝑃 𝐹 = 0 =
𝑃 𝑆 = 0|𝐹 = 1 =
𝑃 𝑆 = 0|𝐹 = 0 =
𝑃 𝐷 = 0|𝐹 = 1 =
𝑃 𝐷 = 0|𝐹 = 0 =
𝑃 𝐺 = 0|𝐹 = 1 =
𝑃 𝐺 = 0|𝐹 = 0 =
𝑃 𝐹|𝑆, 𝐷, 𝐺 = 𝑃 𝐹 P S F P D F P(G|F)
Naïve Bayes: Subtlety #1
• Often the 𝑋𝑖 are not really conditionally independent
• Naïve Bayes often works pretty well anyway
• Often the right classification, even when not the right probability
[Domingos & Pazzani, 1996])
• What is the effect on estimated P(Y|X)?
• What if we have two copies: 𝑋𝑖 = 𝑋𝑘
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
Slide credit: Tom Mitchell
Naïve Bayes: Subtlety #2
MLE estimate for 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) might be zero.
(for example, 𝑋𝑖 = birthdate. 𝑋𝑖 = Feb_4_1995)
• Why worry about just one parameter out of many?
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
• What can we do to address this?
• MAP estimates (adding “imaginary” examples)
Slide credit: Tom Mitchell
Estimating parameters: discrete 𝑌, 𝑋𝑖
• Maximum likelihood estimates (MLE)
𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =
#𝐷 𝑌 = 𝑦𝑘
𝐷
𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =
#𝐷 𝑋𝑖 = 𝑥𝑖𝑗, 𝑌 = 𝑦𝑘
#𝐷{𝑌 = 𝑦𝑘}
• MAP estimates (Dirichlet priors):
𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =
#𝐷 𝑌 = 𝑦𝑘 + (𝛽𝑘−1)
𝐷 + 𝑚(𝛽𝑚−1)
𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =
#𝐷 𝑋𝑖 = 𝑥𝑖𝑗, 𝑌 = 𝑦𝑘 + (𝛽𝑘 −1)
#𝐷{𝑌 = 𝑦𝑘} + 𝑚(𝛽𝑚−1)
Slide credit: Tom Mitchell
What if we have continuous Xi
• Gaussian Naïve Bayes (GNB): assume
𝑃 𝑋𝑖 = 𝑥 𝑌 = 𝑦𝑘 =
1
2𝜋𝜎𝑖𝑘
exp(−
𝑥 − 𝜇𝑖𝑘
2𝜎𝑖𝑘
2
2
)
• Additional assumption on 𝜎𝑖𝑘:
• Is independent of 𝑌 (𝜎𝑖)
• Is independent of 𝑋𝑖 (𝜎𝑘)
• Is independent of 𝑋i and 𝑌 (𝜎)
Slide credit: Tom Mitchell
Naïve Bayes algorithm – continuous Xi
• For each value yk
Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each attribute Xi estimate
Class conditional mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘
• Classify Xtest
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖
test
𝑌 = 𝑦𝑘)
𝑌 ← argmax
𝑦𝑘
𝜋𝑘 Π𝑖 𝑁𝑜𝑟𝑚𝑎𝑙(𝑋𝑖
test
, 𝜇𝑖𝑘, 𝜎𝑖𝑘)
Slide credit: Tom Mitchell
Things to remember
• Probability basics
• Conditional probability, joint probability, Bayes rule
• Estimating parameters from data
• Maximum likelihood (ML) maximize 𝑃(Data|𝜃)
• Maximum a posteriori estimation (MAP) maximize 𝑃(𝜃|Data)
• Naive Bayes
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
• Threshold classifier output ℎ𝜃 𝑥 at 0.5
• If ℎ𝜃 𝑥 ≥ 0.5, predict “𝑦 = 1”
• If ℎ𝜃 𝑥 < 0.5, predict “𝑦 = 0”
Malignant?
0 (No)
1 (Yes)
Tumor Size
ℎ𝜃 𝑥 = 𝜃⊤𝑥
Slide credit: Andrew Ng
Classification: 𝑦 = 1 or 𝑦 = 0
ℎ𝜃 𝑥 = 𝜃⊤
𝑥 (from linear regression)
can be > 1 or < 0
Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1
Logistic regression is actually for classification
Slide credit: Andrew Ng
Hypothesis representation
• Want 0 ≤ ℎ𝜃 𝑥 ≤ 1
• ℎ𝜃 𝑥 = 𝑔 𝜃⊤
𝑥 ,
where 𝑔 𝑧 =
1
1+𝑒−𝑧
• Sigmoid function
• Logistic function
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
𝑧
𝑔(𝑧)
Slide credit: Andrew Ng
Interpretation of hypothesis output
• ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥
• Example: If 𝑥 =
𝑥0
x1
=
1
tumorSize
• ℎ𝜃 𝑥 = 0.7
• Tell patient that 70% chance of tumor being malignant
Slide credit: Andrew Ng
Logistic regression
ℎ𝜃 𝑥 = 𝑔 𝜃⊤
𝑥
𝑔 𝑧 =
1
1 + 𝑒−𝑧
Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5
𝑧 = 𝜃⊤
𝑥 ≥ 0
predict “y = 0” if ℎ𝜃 𝑥 < 0.5
𝑧 = 𝜃⊤
𝑥 < 0
𝑧 = 𝜃⊤
𝑥
𝑔(𝑧)
Slide credit: Andrew Ng
Decision boundary
•ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2)
E.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
•Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
Tumor Size
Age
Slide credit: Andrew Ng
• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2
+ 𝜃3𝑥1
2
+ 𝜃4𝑥2
2
)
E.g., 𝜃0 = −1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1, 𝜃4 = 1
• Predict “𝑦 = 1” if −1 + 𝑥1
2
+ 𝑥2
2
≥ 0
• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1
2
+
𝜃4𝑥1
2
𝑥2 + 𝜃5𝑥1
2
𝑥2
2
+ 𝜃6𝑥1
3
𝑥2 + ⋯ )
Slide credit: Andrew Ng
Where does the form come from?
• Logistic regression hypothesis representation
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
=
1
1 + 𝑒−(𝜃0+𝜃1𝑥1+𝜃2𝑥2+⋯+𝜃𝑛𝑥𝑛)
• Consider learning f: 𝑋 → 𝑌, where
• 𝑋 is a vector of real-valued features 𝑋1, ⋯ , 𝑋𝑛
⊤
• 𝑌 is Boolean
• Assume all 𝑋𝑖 are conditionally independent given 𝑌
• Model 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘 as Gaussian 𝑁 𝜇𝑖𝑘, 𝜎𝑖
• Model 𝑃 𝑌 as Bernoulli 𝜋
What is 𝑃 𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛 ? Slide credit: Tom Mitchell
• 𝑃 𝑌 = 1|𝑋 =
𝑃 𝑌=1 𝑃(𝑋|𝑌=1)
𝑃 𝑌=1 𝑃(𝑋|𝑌=1) +𝑃(𝑌=0)𝑃(𝑋|𝑌=0)
=
1
1+
𝑃 𝑌=0 𝑃(𝑋|𝑌=0)
𝑃 𝑌=1 𝑃(𝑋|𝑌=1)
=
1
1+exp(ln(
𝑃 𝑌=0 𝑃(𝑋|𝑌=0)
𝑃 𝑌=1 𝑃(𝑋|𝑌=1)
))
=
1
1+exp(ln
1−𝜋
𝜋
+ 𝑖 𝑙𝑛
𝑃(𝑋𝑖|𝑌=0)
𝑃(𝑋𝑖|𝑌=1)
)
𝑃 𝑌 = 1|𝑋1, 𝑋2, ⋯ , 𝑋𝑛 =
1
1 + exp(𝜃0 + 𝑖 𝜃𝑖 𝑋𝑖)
𝑖
(
𝜇𝑖0 − 𝜇𝑖1
𝜎𝑖
2 𝑋𝑖 +
𝜇𝑖1
2
− 𝜇𝑖0
2
2𝜎𝑖
2 )
Applying Bayes rule
Divide by 𝑃 𝑌 = 1 𝑃(𝑋|𝑌 = 1)
Apply exp(ln(⋅))
Plug in 𝑃(𝑋𝑖|𝑌)
𝑃 𝑥|𝑦𝑘 =
1
2𝜋𝜎𝑖
𝑒
−
𝑥−𝜇𝑖𝑘
2
2𝜎𝑖
2
Slide credit: Tom Mitchell
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
Training set with 𝑚 examples
{ 𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
𝑥 ∈
𝑥0
𝑥1
⋮
𝑥𝑛
𝑥0 = 1, 𝑦 ∈ {0, 1}
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
How to choose parameters 𝜃?
Slide credit: Andrew Ng
Cost function for Linear Regression
𝐽 𝜃 =
1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦 𝑖 2
=
1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥 𝑖
), 𝑦))
Cost(ℎ𝜃 𝑥 , 𝑦) =
1
2
ℎ𝜃 𝑥 − 𝑦 2
Slide credit: Andrew Ng
Cost function for Logistic Regression
Cost(ℎ𝜃 𝑥 , 𝑦) =
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
1
0
if 𝑦 = 1
ℎ𝜃 𝑥 1
0
if 𝑦 = 0
ℎ𝜃 𝑥 Slide credit: Andrew Ng
Logistic regression cost function
• Cost(ℎ𝜃 𝑥 , 𝑦) =
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
• Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥
• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥 Slide credit: Andrew Ng
Logistic regression
𝐽 𝜃 =
1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥 𝑖 ), 𝑦(𝑖)))
= −
1
𝑚 𝑖=1
𝑚
𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖)
Prediction: given new 𝑥
Output ℎ𝜃 𝑥 =
1
1+𝑒−𝜃⊤𝑥
Learning: fit parameter 𝜃
min
𝜃
𝐽(𝜃)
Slide credit: Andrew Ng
Where does the cost come from?
• Training set with 𝑚 examples
𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
• Maximum likelihood estimate for parameter 𝜃
𝜃MLE = argmax
𝜃
𝑃𝜃 𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
= argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑥 𝑖
, 𝑦 𝑖
• Maximum conditional likelihood estimate for parameter 𝜃
Slide credit: Tom Mitchell
• Goal: choose 𝜃 to maximize conditional likelihood of training data
• 𝑃𝜃 𝑌 = 1 𝑋 = 𝑥 = ℎ𝜃 𝑥 =
1
1+𝑒−𝜃⊤𝑥
• 𝑃𝜃 𝑌 = 0 𝑋 = 𝑥 = 1 − ℎ𝜃 𝑥 =
𝑒−𝜃⊤𝑥
1+𝑒−𝜃⊤𝑥
• Training data D = 𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
• Data likelihood = 𝑖=1
𝑚
𝑃𝜃 𝑥 𝑖
, 𝑦 𝑖
• Data conditional likelihood = 𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)
|𝑥 𝑖
𝜃MCLE = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖
Slide credit: Tom Mitchell
Expressing conditional log-likelihood
𝐿 𝜃 = log
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 =
𝑖=1
𝑚
log 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖
=
𝑖=1
𝑚
𝑦(𝑖) log 𝑃𝜃 𝑦(𝑖) = 1|𝑥 𝑖 + 1 − 𝑦 𝑖 log 𝑃𝜃 𝑦(𝑖) = 0|𝑥 𝑖
= 𝑖=1
𝑚
𝑦(𝑖) log (ℎ𝜃(𝑥(𝑖))) + 1 − 𝑦 𝑖 log(1 − ℎ𝜃(𝑥(𝑖)))
Cost(ℎ𝜃 𝑥 , 𝑦) =
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
Gradient descent
𝐽 𝜃 = −
1
𝑚
𝑖=1
𝑚
𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖)
Goal: min
𝜃
𝐽(𝜃)
Repeat {
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃𝑗
𝐽(𝜃)
}
(Simultaneously update all 𝜃𝑗)
𝜕
𝜕𝜃𝑗
𝐽 𝜃 =
1
𝑚
𝑖=1
𝑚
(ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
) 𝑥𝑗
(𝑖)
Good news: Convex function!
Bad news: No analytical solution
Slide credit: Andrew Ng
Gradient descent
𝐽 𝜃 = −
1
𝑚
𝑖=1
𝑚
𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖)
Goal: min
𝜃
𝐽(𝜃)
Repeat {
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
}
(Simultaneously update all 𝜃𝑗)
Slide credit: Andrew Ng
Gradient descent for Linear Regression
Repeat {
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
}
Gradient descent for Logistic Regression
Repeat {
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
}
ℎ𝜃 𝑥 = 𝜃⊤
𝑥
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
Slide credit: Andrew Ng
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
How about MAP?
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)
𝜃MCLE = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)
|𝑥 𝑖
𝜃MCAP = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)
Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃):
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting
Slide credit: Tom Mitchell
MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
• Maximum conditional a posterior estimate (MCAP)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby
• Medical diagrams: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow
Slide credit: Andrew Ng
Binary classification
𝑥2
𝑥1
Multiclass classification
𝑥2
𝑥1
One-vs-all (one-vs-rest)
𝑥2
𝑥1
Class 1:
Class 2:
Class 3:
ℎ𝜃
𝑖
𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
ℎ𝜃
1
𝑥
ℎ𝜃
2
𝑥
ℎ𝜃
3
𝑥
Slide credit: Andrew Ng
One-vs-all
•Train a logistic regression classifier ℎ𝜃
𝑖
𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
•Given a new input 𝑥, pick the class 𝑖 that
maximizes
max
i
ℎ𝜃
𝑖
𝑥
Slide credit: Andrew Ng
Generative Approach
Ex: Naïve Bayes
Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)
Prediction
𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)
Discriminative Approach
Ex: Logistic regression
Estimate 𝑃(𝑌|𝑋) directly
(Or a discriminant function: e.g., SVM)
Prediction
𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
Further readings
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic
Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
• Andrew Ng, Michael Jordan
On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-
classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
Things to remember
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
Cost(ℎ𝜃 𝑥 , 𝑦) =
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗
(𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗
(𝑖)
max
i
ℎ𝜃
𝑖
𝑥
Coming up…
• Regularization
• Support Vector Machine

More Related Content

Similar to Lec05.pptx

13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptxKarasuLee
 
CrashCourse_0622
CrashCourse_0622CrashCourse_0622
CrashCourse_0622Dexen Xi
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
DL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfDL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfsagayalavanya2
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descentRevanth Kumar
 
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...rofiho9697
 
Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copyShuai Zhang
 
Transformers.pdf
Transformers.pdfTransformers.pdf
Transformers.pdfAli Zoljodi
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffTaeoh Kim
 
baysian in machine learning in Supervised Learning .pptx
baysian in machine learning in Supervised Learning .pptxbaysian in machine learning in Supervised Learning .pptx
baysian in machine learning in Supervised Learning .pptxObsiElias
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rlChoiJinwon3
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementationJongsu "Liam" Kim
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networksLet's talk about IT
 

Similar to Lec05.pptx (20)

13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
 
CrashCourse_0622
CrashCourse_0622CrashCourse_0622
CrashCourse_0622
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
DL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfDL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdf
 
Regression
RegressionRegression
Regression
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...Calculus Review Session Brian Prest Duke University Nicholas School of the En...
Calculus Review Session Brian Prest Duke University Nicholas School of the En...
 
Learning group em - 20171025 - copy
Learning group   em - 20171025 - copyLearning group   em - 20171025 - copy
Learning group em - 20171025 - copy
 
Transformers.pdf
Transformers.pdfTransformers.pdf
Transformers.pdf
 
PR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion TradeoffPR 113: The Perception Distortion Tradeoff
PR 113: The Perception Distortion Tradeoff
 
baysian in machine learning in Supervised Learning .pptx
baysian in machine learning in Supervised Learning .pptxbaysian in machine learning in Supervised Learning .pptx
baysian in machine learning in Supervised Learning .pptx
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Stochastic optimal control &amp; rl
Stochastic optimal control &amp; rlStochastic optimal control &amp; rl
Stochastic optimal control &amp; rl
 
Solving Poisson Equation using Conjugate Gradient Method and its implementation
Solving Poisson Equation using Conjugate Gradient Methodand its implementationSolving Poisson Equation using Conjugate Gradient Methodand its implementation
Solving Poisson Equation using Conjugate Gradient Method and its implementation
 
Stochastic Optimization
Stochastic OptimizationStochastic Optimization
Stochastic Optimization
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networks
 

Recently uploaded

Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Serviceparulsinha
 
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiCall Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiNehru place Escorts
 
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaPooja Gupta
 
Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowSonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowRiya Pathan
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escortsaditipandeya
 
Call Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service JaipurHigh Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipurparulsinha
 
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Miss joya
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...Garima Khatri
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableNehru place Escorts
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...narwatsonia7
 

Recently uploaded (20)

Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
 
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiCall Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
 
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
 
Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowSonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Sonagachi Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
 
Call Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Yelahanka Just Call 7001305949 Top Class Call Girl Service Available
 
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service JaipurHigh Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
 
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
 

Lec05.pptx

  • 1. Logistic Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824
  • 2. Administrative • Please start HW 1 early! • Questions are welcome!
  • 3. Two principles for estimating parameters •Maximum Likelihood Estimate (MLE) Choose 𝜃 that maximizes probability of observed data 𝜽MLE = argmax 𝜃 𝑃(𝐷𝑎𝑡𝑎|𝜃) •Maximum a posteriori estimation (MAP) Choose 𝜃 that is most probable given prior probability and data 𝜽MAP = argmax 𝜃 𝑃 𝜃 𝐷 = argmax 𝜃 𝑃 𝐷𝑎𝑡𝑎 𝜃 𝑃 𝜃 𝑃(𝐷𝑎𝑡𝑎) Slide credit: Tom Mitchell
  • 4. Naïve Bayes classifier • Want to learn 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛) • But require 𝟐𝒏 parameters... • How about applying Bayes rule? • 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛) = 𝑃(𝑋1,⋯,𝑋𝑛 𝑌 𝑃 𝑌 𝑃(𝑋1,⋯,𝑋𝑛) ∝ 𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 𝑃 𝑌 • 𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 : Need (𝟐𝒏 −𝟏) × 𝟐 parameters • 𝑃(𝑌): Need 1 parameter • Apply conditional independence assumption • 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑗=1 𝑛 𝑃(𝑋𝑗|𝑌): Need 𝐧 × 𝟐 parameters
  • 5. Naïve Bayes classifier • Bayes rule: 𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) = 𝑃(𝑌 = 𝑦𝑘)𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑘 𝑗 𝑃 𝑌 = 𝑦𝑗 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑗 • Assume conditional independence among 𝑋𝑖’s: 𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) = 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) 𝑗 𝑃 𝑌 = 𝑦𝑗 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑗) • Pick the most probable Y 𝑌 ← argmax 𝑦𝑘 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) Slide credit: Tom Mitchell
  • 6. Example • 𝑃 𝑌 𝑋1, 𝑋2 ∝ 𝑃 𝑌 𝑃 𝑋1, 𝑋2 𝑌 = 𝑃 𝑌 𝑃 𝑋1 𝑌 𝑃(𝑋2 𝑌 • Estimating parameters • Test example: 𝑋1 = 1, 𝑋2 = 0 • 𝑌 = 1: 𝑃 𝑌 = 1 𝑃 𝑋1 = 1|𝑌 = 1 𝑃 𝑋2 = 0|𝑌 = 1 = 0.4 × 0.2 × 0.7 = 0.056 • 𝑌 = 0: 𝑃 𝑌 = 0 𝑃 𝑋1 = 1|𝑌 = 0 𝑃 𝑋2 = 0|𝑌 = 0 = 0.6 × 0.7 × 0.1 = 0.042 Bayes rule Conditional indep. 𝑃 𝑌 = 1 = 0.4 𝑃 𝑋1 = 1|𝑌 = 1 = 0.2 𝑃 𝑋1 = 1|𝑌 = 0 = 0.7 𝑃 𝑋2 = 1|𝑌 = 1 = 0.3 𝑃 𝑋2 = 1|𝑌 = 0 = 0.9 𝑃 𝑌 = 0 = 0.6 𝑃 𝑋1 = 0|𝑌 = 1 = 0.8 𝑃 𝑋1 = 0|𝑌 = 0 = 0.3 𝑃 𝑋2 = 0|𝑌 = 1 = 0.7 𝑃 𝑋2 = 0|𝑌 = 0 = 0.1
  • 7. Naïve Bayes algorithm – discrete Xi • For each value yk Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘) For each value xij of each attribute Xi Estimate 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘) • Classify Xtest 𝑌 ← argmax 𝑦𝑘 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 test 𝑌 = 𝑦𝑘) 𝑌 ← argmax 𝑦𝑘 𝜋𝑘 Π𝑖𝜃𝑖𝑗𝑘 Slide credit: Tom Mitchell
  • 8. Estimating parameters: discrete 𝑌, 𝑋𝑖 • Maximum likelihood estimates (MLE) 𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 = #𝐷 𝑌 = 𝑦𝑘 𝐷 𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 = #𝐷 𝑋𝑖 = 𝑥𝑖𝑗 ^ 𝑌 = 𝑦𝑘 #𝐷{𝑌 = 𝑦𝑘} Slide credit: Tom Mitchell
  • 9. • F = 1 iff you live in Fox Ridge • S = 1 iff you watched the superbowl last night • D = 1 iff you drive to VT • G = 1 iff you went to gym in the last month 𝑃 𝐹 = 1 = 𝑃 𝑆 = 1|𝐹 = 1 = 𝑃 𝑆 = 1|𝐹 = 0 = 𝑃 𝐷 = 1|𝐹 = 1 = 𝑃 𝐷 = 1|𝐹 = 0 = 𝑃 𝐺 = 1|𝐹 = 1 = 𝑃 𝐺 = 1|𝐹 = 0 = 𝑃 𝐹 = 0 = 𝑃 𝑆 = 0|𝐹 = 1 = 𝑃 𝑆 = 0|𝐹 = 0 = 𝑃 𝐷 = 0|𝐹 = 1 = 𝑃 𝐷 = 0|𝐹 = 0 = 𝑃 𝐺 = 0|𝐹 = 1 = 𝑃 𝐺 = 0|𝐹 = 0 = 𝑃 𝐹|𝑆, 𝐷, 𝐺 = 𝑃 𝐹 P S F P D F P(G|F)
  • 10. Naïve Bayes: Subtlety #1 • Often the 𝑋𝑖 are not really conditionally independent • Naïve Bayes often works pretty well anyway • Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) • What is the effect on estimated P(Y|X)? • What if we have two copies: 𝑋𝑖 = 𝑋𝑘 𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) Slide credit: Tom Mitchell
  • 11. Naïve Bayes: Subtlety #2 MLE estimate for 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) might be zero. (for example, 𝑋𝑖 = birthdate. 𝑋𝑖 = Feb_4_1995) • Why worry about just one parameter out of many? 𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) • What can we do to address this? • MAP estimates (adding “imaginary” examples) Slide credit: Tom Mitchell
  • 12. Estimating parameters: discrete 𝑌, 𝑋𝑖 • Maximum likelihood estimates (MLE) 𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 = #𝐷 𝑌 = 𝑦𝑘 𝐷 𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 = #𝐷 𝑋𝑖 = 𝑥𝑖𝑗, 𝑌 = 𝑦𝑘 #𝐷{𝑌 = 𝑦𝑘} • MAP estimates (Dirichlet priors): 𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 = #𝐷 𝑌 = 𝑦𝑘 + (𝛽𝑘−1) 𝐷 + 𝑚(𝛽𝑚−1) 𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 = #𝐷 𝑋𝑖 = 𝑥𝑖𝑗, 𝑌 = 𝑦𝑘 + (𝛽𝑘 −1) #𝐷{𝑌 = 𝑦𝑘} + 𝑚(𝛽𝑚−1) Slide credit: Tom Mitchell
  • 13. What if we have continuous Xi • Gaussian Naïve Bayes (GNB): assume 𝑃 𝑋𝑖 = 𝑥 𝑌 = 𝑦𝑘 = 1 2𝜋𝜎𝑖𝑘 exp(− 𝑥 − 𝜇𝑖𝑘 2𝜎𝑖𝑘 2 2 ) • Additional assumption on 𝜎𝑖𝑘: • Is independent of 𝑌 (𝜎𝑖) • Is independent of 𝑋𝑖 (𝜎𝑘) • Is independent of 𝑋i and 𝑌 (𝜎) Slide credit: Tom Mitchell
  • 14. Naïve Bayes algorithm – continuous Xi • For each value yk Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘) For each attribute Xi estimate Class conditional mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘 • Classify Xtest 𝑌 ← argmax 𝑦𝑘 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 test 𝑌 = 𝑦𝑘) 𝑌 ← argmax 𝑦𝑘 𝜋𝑘 Π𝑖 𝑁𝑜𝑟𝑚𝑎𝑙(𝑋𝑖 test , 𝜇𝑖𝑘, 𝜎𝑖𝑘) Slide credit: Tom Mitchell
  • 15. Things to remember • Probability basics • Conditional probability, joint probability, Bayes rule • Estimating parameters from data • Maximum likelihood (ML) maximize 𝑃(Data|𝜃) • Maximum a posteriori estimation (MAP) maximize 𝑃(𝜃|Data) • Naive Bayes 𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
  • 16. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
  • 17. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
  • 18. • Threshold classifier output ℎ𝜃 𝑥 at 0.5 • If ℎ𝜃 𝑥 ≥ 0.5, predict “𝑦 = 1” • If ℎ𝜃 𝑥 < 0.5, predict “𝑦 = 0” Malignant? 0 (No) 1 (Yes) Tumor Size ℎ𝜃 𝑥 = 𝜃⊤𝑥 Slide credit: Andrew Ng
  • 19. Classification: 𝑦 = 1 or 𝑦 = 0 ℎ𝜃 𝑥 = 𝜃⊤ 𝑥 (from linear regression) can be > 1 or < 0 Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1 Logistic regression is actually for classification Slide credit: Andrew Ng
  • 20. Hypothesis representation • Want 0 ≤ ℎ𝜃 𝑥 ≤ 1 • ℎ𝜃 𝑥 = 𝑔 𝜃⊤ 𝑥 , where 𝑔 𝑧 = 1 1+𝑒−𝑧 • Sigmoid function • Logistic function ℎ𝜃 𝑥 = 1 1 + 𝑒−𝜃⊤𝑥 𝑧 𝑔(𝑧) Slide credit: Andrew Ng
  • 21. Interpretation of hypothesis output • ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥 • Example: If 𝑥 = 𝑥0 x1 = 1 tumorSize • ℎ𝜃 𝑥 = 0.7 • Tell patient that 70% chance of tumor being malignant Slide credit: Andrew Ng
  • 22. Logistic regression ℎ𝜃 𝑥 = 𝑔 𝜃⊤ 𝑥 𝑔 𝑧 = 1 1 + 𝑒−𝑧 Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5 𝑧 = 𝜃⊤ 𝑥 ≥ 0 predict “y = 0” if ℎ𝜃 𝑥 < 0.5 𝑧 = 𝜃⊤ 𝑥 < 0 𝑧 = 𝜃⊤ 𝑥 𝑔(𝑧) Slide credit: Andrew Ng
  • 23. Decision boundary •ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2) E.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1 •Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0 Tumor Size Age Slide credit: Andrew Ng
  • 24. • ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1 2 + 𝜃4𝑥2 2 ) E.g., 𝜃0 = −1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1, 𝜃4 = 1 • Predict “𝑦 = 1” if −1 + 𝑥1 2 + 𝑥2 2 ≥ 0 • ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1 2 + 𝜃4𝑥1 2 𝑥2 + 𝜃5𝑥1 2 𝑥2 2 + 𝜃6𝑥1 3 𝑥2 + ⋯ ) Slide credit: Andrew Ng
  • 25. Where does the form come from? • Logistic regression hypothesis representation ℎ𝜃 𝑥 = 1 1 + 𝑒−𝜃⊤𝑥 = 1 1 + 𝑒−(𝜃0+𝜃1𝑥1+𝜃2𝑥2+⋯+𝜃𝑛𝑥𝑛) • Consider learning f: 𝑋 → 𝑌, where • 𝑋 is a vector of real-valued features 𝑋1, ⋯ , 𝑋𝑛 ⊤ • 𝑌 is Boolean • Assume all 𝑋𝑖 are conditionally independent given 𝑌 • Model 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘 as Gaussian 𝑁 𝜇𝑖𝑘, 𝜎𝑖 • Model 𝑃 𝑌 as Bernoulli 𝜋 What is 𝑃 𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛 ? Slide credit: Tom Mitchell
  • 26. • 𝑃 𝑌 = 1|𝑋 = 𝑃 𝑌=1 𝑃(𝑋|𝑌=1) 𝑃 𝑌=1 𝑃(𝑋|𝑌=1) +𝑃(𝑌=0)𝑃(𝑋|𝑌=0) = 1 1+ 𝑃 𝑌=0 𝑃(𝑋|𝑌=0) 𝑃 𝑌=1 𝑃(𝑋|𝑌=1) = 1 1+exp(ln( 𝑃 𝑌=0 𝑃(𝑋|𝑌=0) 𝑃 𝑌=1 𝑃(𝑋|𝑌=1) )) = 1 1+exp(ln 1−𝜋 𝜋 + 𝑖 𝑙𝑛 𝑃(𝑋𝑖|𝑌=0) 𝑃(𝑋𝑖|𝑌=1) ) 𝑃 𝑌 = 1|𝑋1, 𝑋2, ⋯ , 𝑋𝑛 = 1 1 + exp(𝜃0 + 𝑖 𝜃𝑖 𝑋𝑖) 𝑖 ( 𝜇𝑖0 − 𝜇𝑖1 𝜎𝑖 2 𝑋𝑖 + 𝜇𝑖1 2 − 𝜇𝑖0 2 2𝜎𝑖 2 ) Applying Bayes rule Divide by 𝑃 𝑌 = 1 𝑃(𝑋|𝑌 = 1) Apply exp(ln(⋅)) Plug in 𝑃(𝑋𝑖|𝑌) 𝑃 𝑥|𝑦𝑘 = 1 2𝜋𝜎𝑖 𝑒 − 𝑥−𝜇𝑖𝑘 2 2𝜎𝑖 2 Slide credit: Tom Mitchell
  • 27.
  • 28. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
  • 29. Training set with 𝑚 examples { 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚 𝑥 ∈ 𝑥0 𝑥1 ⋮ 𝑥𝑛 𝑥0 = 1, 𝑦 ∈ {0, 1} ℎ𝜃 𝑥 = 1 1 + 𝑒−𝜃⊤𝑥 How to choose parameters 𝜃? Slide credit: Andrew Ng
  • 30. Cost function for Linear Regression 𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2 = 1 𝑚 𝑖=1 𝑚 Cost(ℎ𝜃(𝑥 𝑖 ), 𝑦)) Cost(ℎ𝜃 𝑥 , 𝑦) = 1 2 ℎ𝜃 𝑥 − 𝑦 2 Slide credit: Andrew Ng
  • 31. Cost function for Logistic Regression Cost(ℎ𝜃 𝑥 , 𝑦) = −log ℎ𝜃 𝑥 if 𝑦 = 1 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0 1 0 if 𝑦 = 1 ℎ𝜃 𝑥 1 0 if 𝑦 = 0 ℎ𝜃 𝑥 Slide credit: Andrew Ng
  • 32. Logistic regression cost function • Cost(ℎ𝜃 𝑥 , 𝑦) = −log ℎ𝜃 𝑥 if 𝑦 = 1 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0 • Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥 • If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥 • If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥 Slide credit: Andrew Ng
  • 33. Logistic regression 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 Cost(ℎ𝜃(𝑥 𝑖 ), 𝑦(𝑖))) = − 1 𝑚 𝑖=1 𝑚 𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖) Prediction: given new 𝑥 Output ℎ𝜃 𝑥 = 1 1+𝑒−𝜃⊤𝑥 Learning: fit parameter 𝜃 min 𝜃 𝐽(𝜃) Slide credit: Andrew Ng
  • 34. Where does the cost come from? • Training set with 𝑚 examples 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚 • Maximum likelihood estimate for parameter 𝜃 𝜃MLE = argmax 𝜃 𝑃𝜃 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚 = argmax 𝜃 𝑖=1 𝑚 𝑃𝜃 𝑥 𝑖 , 𝑦 𝑖 • Maximum conditional likelihood estimate for parameter 𝜃 Slide credit: Tom Mitchell
  • 35. • Goal: choose 𝜃 to maximize conditional likelihood of training data • 𝑃𝜃 𝑌 = 1 𝑋 = 𝑥 = ℎ𝜃 𝑥 = 1 1+𝑒−𝜃⊤𝑥 • 𝑃𝜃 𝑌 = 0 𝑋 = 𝑥 = 1 − ℎ𝜃 𝑥 = 𝑒−𝜃⊤𝑥 1+𝑒−𝜃⊤𝑥 • Training data D = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚 • Data likelihood = 𝑖=1 𝑚 𝑃𝜃 𝑥 𝑖 , 𝑦 𝑖 • Data conditional likelihood = 𝑖=1 𝑚 𝑃𝜃 𝑦(𝑖) |𝑥 𝑖 𝜃MCLE = argmax 𝜃 𝑖=1 𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 Slide credit: Tom Mitchell
  • 36. Expressing conditional log-likelihood 𝐿 𝜃 = log 𝑖=1 𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 = 𝑖=1 𝑚 log 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 = 𝑖=1 𝑚 𝑦(𝑖) log 𝑃𝜃 𝑦(𝑖) = 1|𝑥 𝑖 + 1 − 𝑦 𝑖 log 𝑃𝜃 𝑦(𝑖) = 0|𝑥 𝑖 = 𝑖=1 𝑚 𝑦(𝑖) log (ℎ𝜃(𝑥(𝑖))) + 1 − 𝑦 𝑖 log(1 − ℎ𝜃(𝑥(𝑖))) Cost(ℎ𝜃 𝑥 , 𝑦) = −log ℎ𝜃 𝑥 if 𝑦 = 1 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
  • 37. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
  • 38. Gradient descent 𝐽 𝜃 = − 1 𝑚 𝑖=1 𝑚 𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖) Goal: min 𝜃 𝐽(𝜃) Repeat { 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝜕 𝜕𝜃𝑗 𝐽(𝜃) } (Simultaneously update all 𝜃𝑗) 𝜕 𝜕𝜃𝑗 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 (ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) ) 𝑥𝑗 (𝑖) Good news: Convex function! Bad news: No analytical solution Slide credit: Andrew Ng
  • 39. Gradient descent 𝐽 𝜃 = − 1 𝑚 𝑖=1 𝑚 𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖) Goal: min 𝜃 𝐽(𝜃) Repeat { 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖) } (Simultaneously update all 𝜃𝑗) Slide credit: Andrew Ng
  • 40. Gradient descent for Linear Regression Repeat { 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖) } Gradient descent for Logistic Regression Repeat { 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖) } ℎ𝜃 𝑥 = 𝜃⊤ 𝑥 ℎ𝜃 𝑥 = 1 1 + 𝑒−𝜃⊤𝑥 Slide credit: Andrew Ng
  • 41. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
  • 42. How about MAP? • Maximum conditional likelihood estimate (MCLE) • Maximum conditional a posterior estimate (MCAP) 𝜃MCLE = argmax 𝜃 𝑖=1 𝑚 𝑃𝜃 𝑦(𝑖) |𝑥 𝑖 𝜃MCAP = argmax 𝜃 𝑖=1 𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)
  • 43. Prior 𝑃(𝜃) • Common choice of 𝑃(𝜃): • Normal distribution, zero mean, identity covariance • “Pushes” parameters towards zeros • Corresponds to Regularization • Helps avoid very large weights and overfitting Slide credit: Tom Mitchell
  • 44. MLE vs. MAP • Maximum conditional likelihood estimate (MCLE) 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖) • Maximum conditional a posterior estimate (MCAP) 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖)
  • 45. Logistic Regression • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification
  • 46. Multi-class classification • Email foldering/taggning: Work, Friends, Family, Hobby • Medical diagrams: Not ill, Cold, Flu • Weather: Sunny, Cloudy, Rain, Snow Slide credit: Andrew Ng
  • 48. One-vs-all (one-vs-rest) 𝑥2 𝑥1 Class 1: Class 2: Class 3: ℎ𝜃 𝑖 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 ℎ𝜃 1 𝑥 ℎ𝜃 2 𝑥 ℎ𝜃 3 𝑥 Slide credit: Andrew Ng
  • 49. One-vs-all •Train a logistic regression classifier ℎ𝜃 𝑖 𝑥 for each class 𝑖 to predict the probability that 𝑦 = 𝑖 •Given a new input 𝑥, pick the class 𝑖 that maximizes max i ℎ𝜃 𝑖 𝑥 Slide credit: Andrew Ng
  • 50. Generative Approach Ex: Naïve Bayes Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌) Prediction 𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦) Discriminative Approach Ex: Logistic regression Estimate 𝑃(𝑌|𝑋) directly (Or a discriminant function: e.g., SVM) Prediction 𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
  • 51. Further readings • Tom M. Mitchell Generative and discriminative classifiers: Naïve Bayes and Logistic Regression http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf • Andrew Ng, Michael Jordan On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes http://papers.nips.cc/paper/2020-on-discriminative-vs-generative- classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
  • 52. Things to remember • Hypothesis representation • Cost function • Logistic regression with gradient descent • Regularization • Multi-class classification ℎ𝜃 𝑥 = 1 1 + 𝑒−𝜃⊤𝑥 Cost(ℎ𝜃 𝑥 , 𝑦) = −log ℎ𝜃 𝑥 if 𝑦 = 1 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖) 𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼 1 𝑚 𝑖=1 𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗 (𝑖) max i ℎ𝜃 𝑖 𝑥
  • 53. Coming up… • Regularization • Support Vector Machine