3. Two principles for estimating parameters
•Maximum Likelihood Estimate (MLE)
Choose 𝜃 that maximizes probability of observed data
𝜽MLE
= argmax
𝜃
𝑃(𝐷𝑎𝑡𝑎|𝜃)
•Maximum a posteriori estimation (MAP)
Choose 𝜃 that is most probable given prior probability and
data
𝜽MAP
= argmax
𝜃
𝑃 𝜃 𝐷 = argmax
𝜃
𝑃 𝐷𝑎𝑡𝑎 𝜃 𝑃 𝜃
𝑃(𝐷𝑎𝑡𝑎)
Slide credit: Tom Mitchell
9. • F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month
𝑃 𝐹 = 1 =
𝑃 𝑆 = 1|𝐹 = 1 =
𝑃 𝑆 = 1|𝐹 = 0 =
𝑃 𝐷 = 1|𝐹 = 1 =
𝑃 𝐷 = 1|𝐹 = 0 =
𝑃 𝐺 = 1|𝐹 = 1 =
𝑃 𝐺 = 1|𝐹 = 0 =
𝑃 𝐹 = 0 =
𝑃 𝑆 = 0|𝐹 = 1 =
𝑃 𝑆 = 0|𝐹 = 0 =
𝑃 𝐷 = 0|𝐹 = 1 =
𝑃 𝐷 = 0|𝐹 = 0 =
𝑃 𝐺 = 0|𝐹 = 1 =
𝑃 𝐺 = 0|𝐹 = 0 =
𝑃 𝐹|𝑆, 𝐷, 𝐺 = 𝑃 𝐹 P S F P D F P(G|F)
10. Naïve Bayes: Subtlety #1
• Often the 𝑋𝑖 are not really conditionally independent
• Naïve Bayes often works pretty well anyway
• Often the right classification, even when not the right probability
[Domingos & Pazzani, 1996])
• What is the effect on estimated P(Y|X)?
• What if we have two copies: 𝑋𝑖 = 𝑋𝑘
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
Slide credit: Tom Mitchell
11. Naïve Bayes: Subtlety #2
MLE estimate for 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) might be zero.
(for example, 𝑋𝑖 = birthdate. 𝑋𝑖 = Feb_4_1995)
• Why worry about just one parameter out of many?
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
• What can we do to address this?
• MAP estimates (adding “imaginary” examples)
Slide credit: Tom Mitchell
13. What if we have continuous Xi
• Gaussian Naïve Bayes (GNB): assume
𝑃 𝑋𝑖 = 𝑥 𝑌 = 𝑦𝑘 =
1
2𝜋𝜎𝑖𝑘
exp(−
𝑥 − 𝜇𝑖𝑘
2𝜎𝑖𝑘
2
2
)
• Additional assumption on 𝜎𝑖𝑘:
• Is independent of 𝑌 (𝜎𝑖)
• Is independent of 𝑋𝑖 (𝜎𝑘)
• Is independent of 𝑋i and 𝑌 (𝜎)
Slide credit: Tom Mitchell
14. Naïve Bayes algorithm – continuous Xi
• For each value yk
Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each attribute Xi estimate
Class conditional mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘
• Classify Xtest
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖
test
𝑌 = 𝑦𝑘)
𝑌 ← argmax
𝑦𝑘
𝜋𝑘 Π𝑖 𝑁𝑜𝑟𝑚𝑎𝑙(𝑋𝑖
test
, 𝜇𝑖𝑘, 𝜎𝑖𝑘)
Slide credit: Tom Mitchell
15. Things to remember
• Probability basics
• Conditional probability, joint probability, Bayes rule
• Estimating parameters from data
• Maximum likelihood (ML) maximize 𝑃(Data|𝜃)
• Maximum a posteriori estimation (MAP) maximize 𝑃(𝜃|Data)
• Naive Bayes
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
16. Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
17. Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
18. • Threshold classifier output ℎ𝜃 𝑥 at 0.5
• If ℎ𝜃 𝑥 ≥ 0.5, predict “𝑦 = 1”
• If ℎ𝜃 𝑥 < 0.5, predict “𝑦 = 0”
Malignant?
0 (No)
1 (Yes)
Tumor Size
ℎ𝜃 𝑥 = 𝜃⊤𝑥
Slide credit: Andrew Ng
19. Classification: 𝑦 = 1 or 𝑦 = 0
ℎ𝜃 𝑥 = 𝜃⊤
𝑥 (from linear regression)
can be > 1 or < 0
Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1
Logistic regression is actually for classification
Slide credit: Andrew Ng
20. Hypothesis representation
• Want 0 ≤ ℎ𝜃 𝑥 ≤ 1
• ℎ𝜃 𝑥 = 𝑔 𝜃⊤
𝑥 ,
where 𝑔 𝑧 =
1
1+𝑒−𝑧
• Sigmoid function
• Logistic function
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
𝑧
𝑔(𝑧)
Slide credit: Andrew Ng
21. Interpretation of hypothesis output
• ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥
• Example: If 𝑥 =
𝑥0
x1
=
1
tumorSize
• ℎ𝜃 𝑥 = 0.7
• Tell patient that 70% chance of tumor being malignant
Slide credit: Andrew Ng
25. Where does the form come from?
• Logistic regression hypothesis representation
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
=
1
1 + 𝑒−(𝜃0+𝜃1𝑥1+𝜃2𝑥2+⋯+𝜃𝑛𝑥𝑛)
• Consider learning f: 𝑋 → 𝑌, where
• 𝑋 is a vector of real-valued features 𝑋1, ⋯ , 𝑋𝑛
⊤
• 𝑌 is Boolean
• Assume all 𝑋𝑖 are conditionally independent given 𝑌
• Model 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘 as Gaussian 𝑁 𝜇𝑖𝑘, 𝜎𝑖
• Model 𝑃 𝑌 as Bernoulli 𝜋
What is 𝑃 𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛 ? Slide credit: Tom Mitchell
41. Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
42. How about MAP?
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)
𝜃MCLE = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)
|𝑥 𝑖
𝜃MCAP = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)
43. Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃):
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting
Slide credit: Tom Mitchell
49. One-vs-all
•Train a logistic regression classifier ℎ𝜃
𝑖
𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
•Given a new input 𝑥, pick the class 𝑖 that
maximizes
max
i
ℎ𝜃
𝑖
𝑥
Slide credit: Andrew Ng
51. Further readings
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic
Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
• Andrew Ng, Michael Jordan
On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-
classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf