Lec05.pptx

Logistic Regression
Jia-Bin Huang
Virginia Tech Spring 2019
ECE-5424G / CS-5824

Administrative
• Please start HW 1 early!
• Questions are welcome!

Two principles for estimating parameters
•Maximum Likelihood Estimate (MLE)
Choose 𝜃 that maximizes probability of observed data
𝜽MLE
= argmax
𝜃
𝑃(𝐷𝑎𝑡𝑎|𝜃)
•Maximum a posteriori estimation (MAP)
Choose 𝜃 that is most probable given prior probability and
data
𝜽MAP
= argmax
𝜃
𝑃 𝜃 𝐷 = argmax
𝜃
𝑃 𝐷𝑎𝑡𝑎 𝜃 𝑃 𝜃
𝑃(𝐷𝑎𝑡𝑎)
Slide credit: Tom Mitchell

Naïve Bayes classifier
• Want to learn 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛)
• But require 𝟐𝒏 parameters...
• How about applying Bayes rule?
• 𝑃 𝑌 𝑋1, ⋯ , 𝑋𝑛) =
𝑃(𝑋1,⋯,𝑋𝑛 𝑌 𝑃 𝑌
𝑃(𝑋1,⋯,𝑋𝑛)
∝ 𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 𝑃 𝑌
• 𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 : Need (𝟐𝒏
−𝟏) × 𝟐 parameters
• 𝑃(𝑌): Need 1 parameter
• Apply conditional independence assumption
• 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑗=1
𝑛
𝑃(𝑋𝑗|𝑌): Need 𝐧 × 𝟐 parameters

Naïve Bayes classifier
• Bayes rule:
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) =
𝑃(𝑌 = 𝑦𝑘)𝑃(𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑘
𝑗 𝑃 𝑌 = 𝑦𝑗 𝑃 𝑋1, ⋯ , 𝑋𝑛 𝑌 = 𝑦𝑗
• Assume conditional independence among 𝑋𝑖’s:
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) =
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)
𝑗 𝑃 𝑌 = 𝑦𝑗 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑗)
• Pick the most probable Y
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)

Example
• 𝑃 𝑌 𝑋1, 𝑋2 ∝ 𝑃 𝑌 𝑃 𝑋1, 𝑋2 𝑌 = 𝑃 𝑌 𝑃 𝑋1 𝑌 𝑃(𝑋2 𝑌
• Estimating parameters
• Test example: 𝑋1 = 1, 𝑋2 = 0
• 𝑌 = 1: 𝑃 𝑌 = 1 𝑃 𝑋1 = 1|𝑌 = 1 𝑃 𝑋2 = 0|𝑌 = 1 = 0.4 × 0.2 × 0.7 = 0.056
• 𝑌 = 0: 𝑃 𝑌 = 0 𝑃 𝑋1 = 1|𝑌 = 0 𝑃 𝑋2 = 0|𝑌 = 0 = 0.6 × 0.7 × 0.1 = 0.042
Bayes rule Conditional indep.
𝑃 𝑌 = 1 = 0.4
𝑃 𝑋1 = 1|𝑌 = 1 = 0.2
𝑃 𝑋1 = 1|𝑌 = 0 = 0.7
𝑃 𝑋2 = 1|𝑌 = 1 = 0.3
𝑃 𝑋2 = 1|𝑌 = 0 = 0.9
𝑃 𝑌 = 0 = 0.6
𝑃 𝑋1 = 0|𝑌 = 1 = 0.8
𝑃 𝑋1 = 0|𝑌 = 0 = 0.3
𝑃 𝑋2 = 0|𝑌 = 1 = 0.7
𝑃 𝑋2 = 0|𝑌 = 0 = 0.1

Naïve Bayes algorithm – discrete Xi
• For each value yk
Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each value xij of each attribute Xi
Estimate 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)
• Classify Xtest
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖
test
𝑌 = 𝑦𝑘)
𝑌 ← argmax
𝑦𝑘
𝜋𝑘 Π𝑖𝜃𝑖𝑗𝑘

Estimating parameters: discrete 𝑌, 𝑋𝑖
• Maximum likelihood estimates (MLE)
𝜋𝑘 = 𝑃 𝑌 = 𝑦𝑘 =
#𝐷 𝑌 = 𝑦𝑘
𝐷
𝜃𝑖𝑗𝑘 = 𝑃 𝑋𝑖 = 𝑥𝑖𝑗 𝑌 = 𝑦𝑘 =
#𝐷 𝑋𝑖 = 𝑥𝑖𝑗 ^ 𝑌 = 𝑦𝑘
#𝐷{𝑌 = 𝑦𝑘}

• F = 1 iff you live in Fox Ridge
• S = 1 iff you watched the superbowl last night
• D = 1 iff you drive to VT
• G = 1 iff you went to gym in the last month
𝑃 𝐹 = 1 =
𝑃 𝑆 = 1|𝐹 = 1 =
𝑃 𝑆 = 1|𝐹 = 0 =
𝑃 𝐷 = 1|𝐹 = 1 =
𝑃 𝐷 = 1|𝐹 = 0 =
𝑃 𝐺 = 1|𝐹 = 1 =
𝑃 𝐺 = 1|𝐹 = 0 =
𝑃 𝐹 = 0 =
𝑃 𝑆 = 0|𝐹 = 1 =
𝑃 𝑆 = 0|𝐹 = 0 =
𝑃 𝐷 = 0|𝐹 = 1 =
𝑃 𝐷 = 0|𝐹 = 0 =
𝑃 𝐺 = 0|𝐹 = 1 =
𝑃 𝐺 = 0|𝐹 = 0 =
𝑃 𝐹|𝑆, 𝐷, 𝐺 = 𝑃 𝐹 P S F P D F P(G|F)

Naïve Bayes: Subtlety #1
• Often the 𝑋𝑖 are not really conditionally independent
• Naïve Bayes often works pretty well anyway
• Often the right classification, even when not the right probability
[Domingos & Pazzani, 1996])
• What is the effect on estimated P(Y|X)?
• What if we have two copies: 𝑋𝑖 = 𝑋𝑘
𝑃 𝑌 = 𝑦𝑘 𝑋1, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖 𝑌 = 𝑦𝑘)

Naïve Bayes: Subtlety #2
MLE estimate for 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) might be zero.
(for example, 𝑋𝑖 = birthdate. 𝑋𝑖 = Feb_4_1995)
• Why worry about just one parameter out of many?
• What can we do to address this?
• MAP estimates (adding “imaginary” examples)

Estimating parameters: discrete 𝑌, 𝑋𝑖
• Maximum likelihood estimates (MLE)
#𝐷 𝑌 = 𝑦𝑘
𝐷
#𝐷 𝑋𝑖 = 𝑥𝑖𝑗, 𝑌 = 𝑦𝑘
#𝐷{𝑌 = 𝑦𝑘}
• MAP estimates (Dirichlet priors):
#𝐷 𝑌 = 𝑦𝑘 + (𝛽𝑘−1)
𝐷 + 𝑚(𝛽𝑚−1)
#𝐷 𝑋𝑖 = 𝑥𝑖𝑗, 𝑌 = 𝑦𝑘 + (𝛽𝑘 −1)
#𝐷{𝑌 = 𝑦𝑘} + 𝑚(𝛽𝑚−1)

What if we have continuous Xi
• Gaussian Naïve Bayes (GNB): assume
𝑃 𝑋𝑖 = 𝑥 𝑌 = 𝑦𝑘 =
1
2𝜋𝜎𝑖𝑘
exp(−
𝑥 − 𝜇𝑖𝑘
2𝜎𝑖𝑘
2
2
)
• Additional assumption on 𝜎𝑖𝑘:
• Is independent of 𝑌 (𝜎𝑖)
• Is independent of 𝑋𝑖 (𝜎𝑘)
• Is independent of 𝑋i and 𝑌 (𝜎)

Naïve Bayes algorithm – continuous Xi
• For each value yk
Estimate 𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
For each attribute Xi estimate
Class conditional mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘
• Classify Xtest
𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖
test
𝑌 = 𝑦𝑘)
𝑌 ← argmax
𝑦𝑘
𝜋𝑘 Π𝑖 𝑁𝑜𝑟𝑚𝑎𝑙(𝑋𝑖
test
, 𝜇𝑖𝑘, 𝜎𝑖𝑘)

Things to remember
• Probability basics
• Conditional probability, joint probability, Bayes rule
• Estimating parameters from data
• Maximum likelihood (ML) maximize 𝑃(Data|𝜃)
• Maximum a posteriori estimation (MAP) maximize 𝑃(𝜃|Data)
• Naive Bayes

Logistic Regression
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification

• Threshold classifier output ℎ𝜃 𝑥 at 0.5
• If ℎ𝜃 𝑥 ≥ 0.5, predict “𝑦 = 1”
• If ℎ𝜃 𝑥 < 0.5, predict “𝑦 = 0”
Malignant?
0 (No)
1 (Yes)
Tumor Size
ℎ𝜃 𝑥 = 𝜃⊤𝑥
Slide credit: Andrew Ng

Classification: 𝑦 = 1 or 𝑦 = 0
ℎ𝜃 𝑥 = 𝜃⊤
𝑥 (from linear regression)
can be > 1 or < 0
Logistic regression: 0 ≤ ℎ𝜃 𝑥 ≤ 1
Logistic regression is actually for classification

Hypothesis representation
• Want 0 ≤ ℎ𝜃 𝑥 ≤ 1
• ℎ𝜃 𝑥 = 𝑔 𝜃⊤
𝑥 ,
where 𝑔 𝑧 =
1
1+𝑒−𝑧
• Sigmoid function
• Logistic function
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
𝑧
𝑔(𝑧)

Interpretation of hypothesis output
• ℎ𝜃 𝑥 = estimated probability that 𝑦 = 1 on input 𝑥
• Example: If 𝑥 =
𝑥0
x1
=
1
tumorSize
• ℎ𝜃 𝑥 = 0.7
• Tell patient that 70% chance of tumor being malignant

Logistic regression
ℎ𝜃 𝑥 = 𝑔 𝜃⊤
𝑥
𝑔 𝑧 =
1
1 + 𝑒−𝑧
Suppose predict “y = 1” if ℎ𝜃 𝑥 ≥ 0.5
𝑧 = 𝜃⊤
𝑥 ≥ 0
predict “y = 0” if ℎ𝜃 𝑥 < 0.5
𝑧 = 𝜃⊤
𝑥 < 0
𝑧 = 𝜃⊤
𝑥
𝑔(𝑧)

Decision boundary
•ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2)
E.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
•Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
Tumor Size
Age

• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2
+ 𝜃3𝑥1
2
+ 𝜃4𝑥2
2
)
E.g., 𝜃0 = −1, 𝜃1 = 0, 𝜃2 = 0, 𝜃3 = 1, 𝜃4 = 1
• Predict “𝑦 = 1” if −1 + 𝑥1
2
+ 𝑥2
2
≥ 0
• ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥1
2
+
𝜃4𝑥1
2
𝑥2 + 𝜃5𝑥1
2
𝑥2
2
+ 𝜃6𝑥1
3
𝑥2 + ⋯ )

Where does the form come from?
• Logistic regression hypothesis representation
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
=
1
1 + 𝑒−(𝜃0+𝜃1𝑥1+𝜃2𝑥2+⋯+𝜃𝑛𝑥𝑛)
• Consider learning f: 𝑋 → 𝑌, where
• 𝑋 is a vector of real-valued features 𝑋1, ⋯ , 𝑋𝑛
⊤
• 𝑌 is Boolean
• Assume all 𝑋𝑖 are conditionally independent given 𝑌
• Model 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘 as Gaussian 𝑁 𝜇𝑖𝑘, 𝜎𝑖
• Model 𝑃 𝑌 as Bernoulli 𝜋
What is 𝑃 𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛 ? Slide credit: Tom Mitchell

Training set with 𝑚 examples
{ 𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
𝑥 ∈
𝑥0
𝑥1
⋮
𝑥𝑛
𝑥0 = 1, 𝑦 ∈ {0, 1}
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
How to choose parameters 𝜃?

Cost function for Linear Regression
𝐽 𝜃 =
1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦 𝑖 2
=
1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥 𝑖
), 𝑦))
Cost(ℎ𝜃 𝑥 , 𝑦) =
1
2
ℎ𝜃 𝑥 − 𝑦 2

Cost function for Logistic Regression
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
1
0
if 𝑦 = 1
ℎ𝜃 𝑥 1
0
if 𝑦 = 0
ℎ𝜃 𝑥 Slide credit: Andrew Ng

Logistic regression cost function
• Cost(ℎ𝜃 𝑥 , 𝑦) =
• Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥
• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥 Slide credit: Andrew Ng

Logistic regression
𝐽 𝜃 =
1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥 𝑖 ), 𝑦(𝑖)))
= −
1
𝑚 𝑖=1
𝑚
𝑦(𝑖) log ℎ𝜃 𝑥(𝑖) + (1 − 𝑦(𝑖)) log 1 − ℎ𝜃 𝑥(𝑖)
Prediction: given new 𝑥
Output ℎ𝜃 𝑥 =
1
1+𝑒−𝜃⊤𝑥
Learning: fit parameter 𝜃
min
𝜃
𝐽(𝜃)

Where does the cost come from?
• Training set with 𝑚 examples
𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
• Maximum likelihood estimate for parameter 𝜃
𝜃MLE = argmax
𝜃
𝑃𝜃 𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
= argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑥 𝑖
, 𝑦 𝑖
• Maximum conditional likelihood estimate for parameter 𝜃

• Goal: choose 𝜃 to maximize conditional likelihood of training data
• 𝑃𝜃 𝑌 = 1 𝑋 = 𝑥 = ℎ𝜃 𝑥 =
1
• 𝑃𝜃 𝑌 = 0 𝑋 = 𝑥 = 1 − ℎ𝜃 𝑥 =
𝑒−𝜃⊤𝑥
• Training data D = 𝑥 1
, 𝑦 1
, 𝑥 2
, 𝑦 2
, ⋯ , 𝑥 𝑚
, 𝑦 𝑚
• Data likelihood = 𝑖=1
𝑚
𝑃𝜃 𝑥 𝑖
, 𝑦 𝑖
• Data conditional likelihood = 𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)
|𝑥 𝑖
𝜃MCLE = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖

Expressing conditional log-likelihood
𝐿 𝜃 = log
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 =
𝑖=1
𝑚
log 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖
=
𝑖=1
𝑚
𝑦(𝑖) log 𝑃𝜃 𝑦(𝑖) = 1|𝑥 𝑖 + 1 − 𝑦 𝑖 log 𝑃𝜃 𝑦(𝑖) = 0|𝑥 𝑖
= 𝑖=1
𝑚
𝑦(𝑖) log (ℎ𝜃(𝑥(𝑖))) + 1 − 𝑦 𝑖 log(1 − ℎ𝜃(𝑥(𝑖)))

Gradient descent
𝐽 𝜃 = −
1
𝑚
𝑖=1
𝑚
Goal: min
𝜃
𝐽(𝜃)
Repeat {
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃𝑗
𝐽(𝜃)
}
(Simultaneously update all 𝜃𝑗)
𝜕
𝜕𝜃𝑗
𝐽 𝜃 =
1
𝑚
𝑖=1
𝑚
(ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
) 𝑥𝑗
(𝑖)
Good news: Convex function!
Bad news: No analytical solution

Gradient descent
𝐽 𝜃 = −
1
𝑚
𝑖=1
𝑚
Goal: min
𝜃
𝐽(𝜃)
Repeat {
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
}
(Simultaneously update all 𝜃𝑗)

Gradient descent for Linear Regression
Repeat {
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
}
Gradient descent for Logistic Regression
Repeat {
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
}
ℎ𝜃 𝑥 = 𝜃⊤
𝑥
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥

How about MAP?
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)
𝜃MCLE = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)
|𝑥 𝑖
𝜃MCAP = argmax
𝜃
𝑖=1
𝑚
𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)

Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃):
• Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization
• Helps avoid very large weights and overfitting

MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)
• Maximum conditional a posterior estimate (MCAP)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖
− 𝑦(𝑖)
𝑥𝑗
(𝑖)

Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby
• Medical diagrams: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow

Binary classification
𝑥2
𝑥1
Multiclass classification
𝑥2
𝑥1

One-vs-all (one-vs-rest)
𝑥2
𝑥1
Class 1:
Class 2:
Class 3:
ℎ𝜃
𝑖
𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
ℎ𝜃
1
𝑥
ℎ𝜃
2
𝑥
ℎ𝜃
3
𝑥

One-vs-all
•Train a logistic regression classifier ℎ𝜃
𝑖
𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
•Given a new input 𝑥, pick the class 𝑖 that
maximizes
max
i
ℎ𝜃
𝑖
𝑥

Generative Approach
Ex: Naïve Bayes
Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)
Prediction
𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)
Discriminative Approach
Ex: Logistic regression
Estimate 𝑃(𝑌|𝑋) directly
(Or a discriminant function: e.g., SVM)
Prediction
𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)

Further readings
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic
Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
• Andrew Ng, Michael Jordan
On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-
classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Things to remember
• Hypothesis representation
• Cost function
• Logistic regression with gradient descent
• Regularization
• Multi-class classification
ℎ𝜃 𝑥 =
1
1 + 𝑒−𝜃⊤𝑥
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗
(𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗
(𝑖)
max
i
ℎ𝜃
𝑖
𝑥

Coming up…
• Regularization
• Support Vector Machine

Lec05.pptx

Recommended

Recommended

More Related Content

Similar to Lec05.pptx

Similar to Lec05.pptx (20)

Recently uploaded

Recently uploaded (20)

Lec05.pptx