SlideShare a Scribd company logo
1 of 56
Download to read offline
Machine Learning Specialization
Linear classifiers:
Overfitting & regularization
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Training and evaluating
a classifier
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization3
Training a classifier = Learning the coefficients
©2015-2016 Emily Fox & Carlos Guestrin
Data
(x,y)
(Sentence1, )
(Sentence2, )
…
Training
set
Validation
set
Learn
classifier
Evaluate?
Word Coefficient
good 1.0
awesome 1.7
bad -1.0
awful -3.3
… …
Machine Learning Specialization5
Test example
Classification error
©2015-2016 Emily Fox & Carlos Guestrin
(Sushi was great, )
Learned classifier
Hide label
Correct
MistakesSushi was great	
ŷ =
Correct!
0
0
1
1(Food was OK, )Food was OK	
Mistake!
Machine Learning Specialization6
Classification error & accuracy
•  Error measures fraction of mistakes
- Best possible value is 0.0
•  Often, measure accuracy
- Fraction of correct predictions
- Best possible value is 1.0
©2015-2016 Emily Fox & Carlos Guestrin
error = .
accuracy= .
Machine Learning Specialization
Overfitting in regression:
review
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization9
Flexibility of high-order polynomials
©2015-2016 Emily Fox & Carlos Guestrin
square feet (sq.ft.)
price($)
x
fŵ
square feet (sq.ft.)
price($)
x
y
fŵ
y
yi = w0 + w1 xi+ w2 xi
2 + … + wp xi
p + εi
OVERFIT
Machine Learning Specialization11
Overfitting in regression
©2015-2016 Emily Fox & Carlos Guestrin
Model complexity
Error=
ResidualSumofSquares(RSS)
x
y
x
y
Overfitting if there exists w*:
•  training_error(w*) > training_error(ŵ)
•  true_error(w*) < true_error(ŵ)
Machine Learning Specialization
Overfitting in classification
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization14
Decision boundary example
©2015-2016 Emily Fox & Carlos Guestrin
x[1]=#awesome
X[2]=#awful
0 1 2 3 4 …
0
1
2
3
4
…
Score(x) > 0
Score(x) < 0
Word Coefficient
#awesome 1.0
#awful -1.5
Score(x) = 1.0 #awesome – 1.5 #awful
Machine Learning Specialization16
Learned decision boundary
©2015-2016 Emily Fox & Carlos Guestrin
Feature Value
Coefficient
learned
h0(x) 1
h1(x) x[1]
h2(x) x[2]
Machine Learning Specialization17
Quadratic features (in 2d)
©2015-2016 Emily Fox & Carlos Guestrin
Note: we are not
including cross
terms for simplicity
Feature Value
Coefficient
learned
h0(x) 1
h1(x) x[1]
h2(x) x[2]
h3(x) (x[1])2
h4(x) (x[2])2
Machine Learning Specialization18
Degree 6 features (in 2d)
©2015-2016 Emily Fox & Carlos Guestrin
Note: we are not
including cross
terms for simplicity
Feature Value
Coefficient
learned
h0(x) 1 21.6
h1(x) x[1] 5.3
h2(x) x[2] -42.7
h3(x) (x[1])2 -15.9
h4(x) (x[2])2 -48.6
h5(x) (x[1])3 -11.0
h6(x) (x[2])3 67.0
h7(x) (x[1])4 1.5
h8(x) (x[2])4 48.0
h9(x) (x[1])5 4.4
h10(x) (x[2])5 -14.2
h11(x) (x[1])6 0.8
h12(x) (x[2])6 -8.6
Score(x) < 0
Coefficient values
getting large
Machine Learning Specialization19
Degree 20 features (in 2d)
©2015-2016 Emily Fox & Carlos Guestrin
Note: we are not
including cross
terms for simplicity
Feature Value
Coefficient
learned
h0(x) 1 8.7
h1(x) x[1] 5.1
h2(x) x[2] 78.7
… … …
h11(x) (x[1])6 -7.5
h12(x) (x[2])6
3803
h13(x) (x[1])7 21.1
h14(x) (x[2])7
-2406
… … …
h37(x) (x[1])19 -2*10-6
h38(x) (x[2])19 -0.15
h39(x) (x[1])20 -2*10-8
h40(x) (x[2])20 0.03
Often, overfitting associated with
very large estimated coefficients ŵ
Machine Learning Specialization20
Overfitting in classification
©2015-2016 Emily Fox & Carlos Guestrin
Model complexity
Error=
ClassificationError
Overfitting if there exists w*:
•  training_error(w*) > training_error(ŵ)
•  true_error(w*) < true_error(ŵ)
Machine Learning Specialization
Overfitting in classifiers è
Overconfident predictions
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization23
Logistic regression model
©2015-2016 Emily Fox & Carlos Guestrin
-∞	 +∞	
wTh(xi)	
0.0	 1.0	
0.0	
0.5	
P(y=+1|xi,w) = sigmoid(wTh(xi))
Machine Learning Specialization24
The subtle (negative) consequence of
overfitting in logistic regression
©2015-2016 Emily Fox & Carlos Guestrin
Overfitting è Large coefficient values
ŵTh(xi) is very positive (or very negative)
è sigmoid(ŵTh(xi)) goes to 1 (or to 0)
Model becomes extremely
overconfident of predictions
Machine Learning Specialization26
Effect of coefficients on
logistic regression model
©2015-2016 Emily Fox & Carlos Guestrin
w0 0
w#awesome +1
w#awful -1
w0 0
w#awesome +6
w#awful -6
w0 0
w#awesome +2
w#awful -2
1
1+ew>h(x)
#awesome - #awful
1
1+ew>h(x)
#awesome - #awful
1
1+ew>h(x)
#awesome - #awful
Input x: #awesome=2, #awful=1
Machine Learning Specialization27
Learned probabilities
©2015-2016 Emily Fox & Carlos Guestrin
Feature Value
Coefficient
learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07
P(y = +1 | x, w) =
1
1 + e w>h(x)
Machine Learning Specialization28
Quadratic features: Learned probabilities
©2015-2016 Emily Fox & Carlos Guestrin
P(y = +1 | x, w) =
1
1 + e w>h(x)
Feature Value
Coefficient
learned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.58
h3(x) (x[1])2 -0.17
h4(x) (x[2])2 -0.96
Machine Learning Specialization29
Overfitting è
Overconfident predictions
©2015-2016 Emily Fox & Carlos Guestrin
Degree 6: Learned probabilities Degree 20: Learned probabilities
Tiny uncertainty regions
è
Overfitting &
overconfident about it!!!
We are sure we are right,
when we are surely wrong! L
Machine Learning Specialization
Overfitting in logistic regression:
Another perspective
©2015-2016 Emily Fox & Carlos Guestrin
OPTIONAL
Machine Learning Specialization33
Linearly-separable data
Data are linearly separable if:
•  There exist coefficients ŵ such that:
- For all positive training data
- For all negative training data
©2015-2016 Emily Fox & Carlos Guestrin
Note 1: If you are using D features, linear
separability happens in a D-dimensional space
Note 2: If you have enough features, data are
(almost) always linearly separable training_error(ŵ) = 0
Machine Learning Specialization34
Effect of linear separability
on coefficients
©2015-2016 Emily Fox & Carlos Guestrin
x[1]=#awesome
X[2]=#awful
0 1 2 3 4 …
0
1
2
3
4
…
Score(x) > 0
Score(x) < 0 Data are linearly separable with
ŵ1=1.0 and ŵ2=-1.5
Data also linearly separable with
ŵ1=10 and ŵ2=-15
Data also linearly separable with
ŵ1=109 and ŵ2=-1.5x109
Machine Learning Specialization35
Effect of linear separability
on weights
©2015-2016 Emily Fox & Carlos Guestrin
Data is linearly separable with
ŵ1=1.0 and ŵ2=-1.5
Data also linearly separable with
ŵ1=10 and ŵ2=-15
Data also linearly separable with
ŵ1=109 and ŵ2=-1.5x109
x[1]= 2
x[2]= 1 	
X[2]=#awful
x[1]=#awesome
1 2 3 4 …0
0
1
2
3
4
…
P(y=+1|xi, ŵ)	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Maximum likelihood estimation (MLE)
prefers most certain model è	
Coefficients go to infinity for linearly-separable data!!!
Machine Learning Specialization37
Overfitting in logistic regression is “twice as bad”
Learning tries
to find decision
boundary that
separates data
Overly
complex
boundary
If data are
linearly
separable
Coefficients
go to
infinity!
©2015-2016 Emily Fox & Carlos Guestrin
ŵ1=109
ŵ2=-1.5x109
Machine Learning Specialization
Penalizing large coefficients
to mitigate overfitting
©2015 Emily Fox & Carlos Guestrin
Machine Learning Specialization39 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y ŵ
h(x)x
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization40
Desired total cost format
Want to balance:
i.  How well function fits data
ii.  Magnitude of coefficients
Total quality =
measure of fit - measure of magnitude
of coefficients
©2015 Emily Fox & Carlos Guestrin
(data likelihood)
large # = good fit to
training data
large # = overfit
want to balance
Machine Learning Specialization41
Maximum likelihood estimation (MLE):
Measure of fit = Data likelihood
•  Choose coefficients w that maximize likelihood:
•  Typically, we use the log of likelihood function
(simplifies math and has better convergence properties)
©2015-2016 Emily Fox & Carlos Guestrin
`(w) =
NY
i=1
P(yi | xi, w)
`(w) = ln
NY
i=1
P(yi | xi, w)
Machine Learning Specialization43
What summary # is indicative of
size of logistic regression coefficients?
-  Sum of squares (L2 norm)
-  Sum of absolute value (L1 norm)
©2015 Emily Fox & Carlos Guestrin
Measure of magnitude of
logistic regression coefficients
Machine Learning Specialization44
Consider specific total cost
Total quality =
measure of fit - measure of magnitude
of coefficients
©2015 Emily Fox & Carlos Guestrin
ℓ(w)(w) ||w||2
2
Machine Learning Specialization45
Consider resulting objective
What if ŵ selected to minimize
If λ=0:
If λ=∞: 	
If λ in between:
ℓ(w) - ||w||2(w) - ||w||2
tuning parameter = balance of fit and magnitude
©2015 Emily Fox & Carlos Guestrin
λ
2
Machine Learning Specialization46
Consider resulting objective
What if ŵ selected to minimize
©2015 Emily Fox & Carlos Guestrin
L2 regularized
logistic regression
2	
ℓ(w) - ||w||2(w) - ||w||2
tuning parameter = balance of fit and magnitude
λ
Pick λ using:
•  Validation set (for large datasets)
•  Cross-validation (for smaller datasets)
(see regression course)
Machine Learning Specialization48
Bias-variance tradeoff
Large λ:
high bias, low variance
(e.g., ŵ =0 for λ=∞)
Small λ:
low bias, high variance
(e.g., maximum likelihood (MLE) fit of
high-order polynomial for λ=0)
©2015 Emily Fox & Carlos Guestrin
In essence, λ
controls model
complexity
Machine Learning Specialization
Visualizing effect of regularization
on logistic regression
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization51
Degree 20 features, λ=0
©2015-2016 Emily Fox & Carlos Guestrin
Feature Value
Coefficient
learned
h0(x) 1 8.7
h1(x) x[1] 5.1
h2(x) x[2] 78.7
… … …
h11(x) (x[1])6 -7.5
h12(x) (x[2])6
3803
h13(x) (x[1])7 21.1
h14(x) (x[2])7
-2406
… … …
h37(x) (x[1])19 -2*10-6
h38(x) (x[2])19 -0.15
h39(x) (x[1])20 -2*10-8
h40(x) (x[2])20 0.03
Coefficients range
from -3170 to 3803
Machine Learning Specialization52
Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10
Range of
coefficients
-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22
Decision
boundary
Degree 20 features,
effect of regularization penalty λ
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization53
Coefficient path
©2015 Emily Fox & Carlos Guestrin
λ
coefficientsŵj
Machine Learning Specialization54
Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1
Range of
coefficients
-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57
Learned
probabilities
Degree 20 features:
regularization reduces “overconfidence”
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Finding best L2 regularized
linear classifier with
gradient ascent
56 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization57 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y ŵ
h(x)x
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization59
Gradient ascent
©2015 Emily Fox & Carlos Guestrin
Algorithm:
while not converged
w(t+1) ß w(t) + η ℓ(w(t))
Δ
Machine Learning Specialization61
Gradient of L2 regularized log-likelihood
Total quality =
measure of fit - measure of magnitude
of coefficients
©2015 Emily Fox & Carlos Guestrin
ℓ(w)(w) λ ||w||2
2	
Total
derivative =	
@||w||2
2
@wj
=λ 	-	
@`(w)
@wj
Machine Learning Specialization63
Derivative of (log-)likelihood
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Difference between truth and prediction
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Feature
value
Derivative of L2 penalty
@||w||2
2
@wj
=
@`(w)
@wj
=
Machine Learning Specialization64
Understanding contribution of
L2 regularization
©2015-2016 Emily Fox & Carlos Guestrin
- 2 λ wj	 Impact on wj 	
wj > 0	
wj < 0	
Term from L2 penalty
@`(w)
@wj
2 wj
Machine Learning Specialization65
Summary of gradient ascent
for logistic regression with
L2 Regularization
©2015 Emily Fox & Carlos Guestrin
init w(1)=0 (or randomly, or smartly), t=1
while not converged:
for j=0,…,D
partial[j] =
wj
(t+1) ß wj
(t) + η (partial[j] – 2λ wj
(t))
t ß t + 1
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w(t)
)
⌘
Machine Learning Specialization
Sparse logistic regression
with L1 regularization
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization68
Efficiency:
-  If size(w) = 100B, each prediction is expensive
-  If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
-  Which features are relevant for prediction?
Recall sparsity (many ŵj=0)
gives efficiency and interpretability
©2015 Emily Fox & Carlos Guestrin
many zeros
ˆyi = sign
0
@
X
ˆwj 6=0
ˆwjhj(xi)
1
A
Machine Learning Specialization69
Sparse logistic regression
Total quality =
measure of fit - measure of magnitude
of coefficients
©2015 Emily Fox & Carlos Guestrin
ℓ(w)(w) ||w||1=|w0|+…+|wD|
L1 regularized
logistic regression
Leads to
sparse
solutions!
Machine Learning Specialization71
L1 regularized logistic regression
Just like L2 regularization, solution is
governed by a continuous parameter λ
If λ=0:
If λ=∞: 	
If λ in between:
ℓ(w) - λ||w||1(w) - λ||w||1
tuning parameter =
balance of fit and sparsity
©2015 Emily Fox & Carlos Guestrin
Machine Learning Specialization72
Regularization path – L2 penalty
©2015 Emily Fox & Carlos Guestrin
λ
Coefficientsŵj
Machine Learning Specialization73
Regularization path – L1 penalty
©2015 Emily Fox & Carlos Guestrin
λ
Coefficientsŵj
Machine Learning Specialization
Summary of overfitting in
logistic regression
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization76
What you can do now…
•  Identify when overfitting is happening
•  Relate large learned coefficients to overfitting
•  Describe the impact of overfitting on decision
boundaries and predicted probabilities of linear
classifiers
•  Motivate the form of L2 regularized logistic
regression quality metric
•  Describe what happens to estimated
coefficients as tuning parameter λ is varied
•  Interpret coefficient path plot
•  Estimate L2 regularized logistic regression
coefficients using gradient ascent
•  Describe the use of L1 regularization to obtain
sparse logistic regression solutions
©2015-2016 Emily Fox & Carlos Guestrin

More Related Content

Viewers also liked

Enterprise-Level Social Media
Enterprise-Level Social MediaEnterprise-Level Social Media
Enterprise-Level Social MediaMandy Poon
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAdeyemi Fowe
 
CNX16 - Learn How Data Science Can Power Smarter Customer Journeys
CNX16 - Learn How Data Science Can Power Smarter Customer JourneysCNX16 - Learn How Data Science Can Power Smarter Customer Journeys
CNX16 - Learn How Data Science Can Power Smarter Customer JourneysCloud_Services
 
Definições das lavagens em jeans
Definições das lavagens em jeansDefinições das lavagens em jeans
Definições das lavagens em jeansDébora Cseri
 
ESTUDIO RADIOLÓGICO DEL APARATO URINARIO
ESTUDIO RADIOLÓGICO DEL APARATO URINARIOESTUDIO RADIOLÓGICO DEL APARATO URINARIO
ESTUDIO RADIOLÓGICO DEL APARATO URINARIODIEGO MONTENEGRO JORDAN
 
Multilingual content with WordPress
Multilingual content with WordPressMultilingual content with WordPress
Multilingual content with WordPressDesaulniers-Simard
 
Health IT & Nursing Quality Improvement (February 4, 2016)
Health IT & Nursing Quality Improvement (February 4, 2016)Health IT & Nursing Quality Improvement (February 4, 2016)
Health IT & Nursing Quality Improvement (February 4, 2016)Nawanan Theera-Ampornpunt
 
Design of Belt conveyor system
 Design of Belt conveyor system Design of Belt conveyor system
Design of Belt conveyor systemAnkit Kumar
 
Segmentação do Mercado de Vestuário - Moda
Segmentação do Mercado de Vestuário - ModaSegmentação do Mercado de Vestuário - Moda
Segmentação do Mercado de Vestuário - ModaDébora Cseri
 
Welcome To The British Museum
Welcome To The British MuseumWelcome To The British Museum
Welcome To The British Museumrll
 

Viewers also liked (15)

Ca Continuous Delivery
Ca Continuous DeliveryCa Continuous Delivery
Ca Continuous Delivery
 
C3.1.2
C3.1.2C3.1.2
C3.1.2
 
DELITOS INFORMATICAS
DELITOS INFORMATICASDELITOS INFORMATICAS
DELITOS INFORMATICAS
 
Enterprise-Level Social Media
Enterprise-Level Social MediaEnterprise-Level Social Media
Enterprise-Level Social Media
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
CNX16 - Learn How Data Science Can Power Smarter Customer Journeys
CNX16 - Learn How Data Science Can Power Smarter Customer JourneysCNX16 - Learn How Data Science Can Power Smarter Customer Journeys
CNX16 - Learn How Data Science Can Power Smarter Customer Journeys
 
Definições das lavagens em jeans
Definições das lavagens em jeansDefinições das lavagens em jeans
Definições das lavagens em jeans
 
ESTUDIO RADIOLÓGICO DEL APARATO URINARIO
ESTUDIO RADIOLÓGICO DEL APARATO URINARIOESTUDIO RADIOLÓGICO DEL APARATO URINARIO
ESTUDIO RADIOLÓGICO DEL APARATO URINARIO
 
Multilingual content with WordPress
Multilingual content with WordPressMultilingual content with WordPress
Multilingual content with WordPress
 
Health IT & Nursing Quality Improvement (February 4, 2016)
Health IT & Nursing Quality Improvement (February 4, 2016)Health IT & Nursing Quality Improvement (February 4, 2016)
Health IT & Nursing Quality Improvement (February 4, 2016)
 
ACCOR HOTELS - FELL WELCOME
ACCOR HOTELS - FELL WELCOMEACCOR HOTELS - FELL WELCOME
ACCOR HOTELS - FELL WELCOME
 
Design of Belt conveyor system
 Design of Belt conveyor system Design of Belt conveyor system
Design of Belt conveyor system
 
Segmentação do Mercado de Vestuário - Moda
Segmentação do Mercado de Vestuário - ModaSegmentação do Mercado de Vestuário - Moda
Segmentação do Mercado de Vestuário - Moda
 
Accor
AccorAccor
Accor
 
Welcome To The British Museum
Welcome To The British MuseumWelcome To The British Museum
Welcome To The British Museum
 

Similar to C3.2.2

L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxajondaree
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksKevin Lee
 
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.pptch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.pptTushar Chaudhari
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1MostafaHazemMostafaa
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningAI Summary
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 

Similar to C3.2.2 (20)

C3.5.1
C3.5.1C3.5.1
C3.5.1
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
C3.7.1
C3.7.1C3.7.1
C3.7.1
 
3ml.pdf
3ml.pdf3ml.pdf
3ml.pdf
 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.pptch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
ch04sdsdsdsdsdsdsdsdsdsdswewrerertrtr.ppt
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Regression
RegressionRegression
Regression
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
ML基本からResNetまで
ML基本からResNetまでML基本からResNetまで
ML基本からResNetまで
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 

More from Daniel LIAO (17)

C4.5
C4.5C4.5
C4.5
 
C4.4
C4.4C4.4
C4.4
 
C4.3.1
C4.3.1C4.3.1
C4.3.1
 
Lime
LimeLime
Lime
 
C4.1.2
C4.1.2C4.1.2
C4.1.2
 
C4.1.1
C4.1.1C4.1.1
C4.1.1
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
C3.4.2
C3.4.2C3.4.2
C3.4.2
 
C3.4.1
C3.4.1C3.4.1
C3.4.1
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
 
C2.3
C2.3C2.3
C2.3
 
C2.6
C2.6C2.6
C2.6
 
C2.2
C2.2C2.2
C2.2
 
C2.5
C2.5C2.5
C2.5
 
C2.4
C2.4C2.4
C2.4
 
C3.1 intro
C3.1 introC3.1 intro
C3.1 intro
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 

Recently uploaded

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Recently uploaded (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

C3.2.2

  • 1. Machine Learning Specialization Linear classifiers: Overfitting & regularization Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington ©2015-2016 Emily Fox & Carlos Guestrin
  • 2. Machine Learning Specialization Training and evaluating a classifier ©2015-2016 Emily Fox & Carlos Guestrin
  • 3. Machine Learning Specialization3 Training a classifier = Learning the coefficients ©2015-2016 Emily Fox & Carlos Guestrin Data (x,y) (Sentence1, ) (Sentence2, ) … Training set Validation set Learn classifier Evaluate? Word Coefficient good 1.0 awesome 1.7 bad -1.0 awful -3.3 … …
  • 4. Machine Learning Specialization5 Test example Classification error ©2015-2016 Emily Fox & Carlos Guestrin (Sushi was great, ) Learned classifier Hide label Correct MistakesSushi was great ŷ = Correct! 0 0 1 1(Food was OK, )Food was OK Mistake!
  • 5. Machine Learning Specialization6 Classification error & accuracy •  Error measures fraction of mistakes - Best possible value is 0.0 •  Often, measure accuracy - Fraction of correct predictions - Best possible value is 1.0 ©2015-2016 Emily Fox & Carlos Guestrin error = . accuracy= .
  • 6. Machine Learning Specialization Overfitting in regression: review ©2015-2016 Emily Fox & Carlos Guestrin
  • 7. Machine Learning Specialization9 Flexibility of high-order polynomials ©2015-2016 Emily Fox & Carlos Guestrin square feet (sq.ft.) price($) x fŵ square feet (sq.ft.) price($) x y fŵ y yi = w0 + w1 xi+ w2 xi 2 + … + wp xi p + εi OVERFIT
  • 8. Machine Learning Specialization11 Overfitting in regression ©2015-2016 Emily Fox & Carlos Guestrin Model complexity Error= ResidualSumofSquares(RSS) x y x y Overfitting if there exists w*: •  training_error(w*) > training_error(ŵ) •  true_error(w*) < true_error(ŵ)
  • 9. Machine Learning Specialization Overfitting in classification ©2015-2016 Emily Fox & Carlos Guestrin
  • 10. Machine Learning Specialization14 Decision boundary example ©2015-2016 Emily Fox & Carlos Guestrin x[1]=#awesome X[2]=#awful 0 1 2 3 4 … 0 1 2 3 4 … Score(x) > 0 Score(x) < 0 Word Coefficient #awesome 1.0 #awful -1.5 Score(x) = 1.0 #awesome – 1.5 #awful
  • 11. Machine Learning Specialization16 Learned decision boundary ©2015-2016 Emily Fox & Carlos Guestrin Feature Value Coefficient learned h0(x) 1 h1(x) x[1] h2(x) x[2]
  • 12. Machine Learning Specialization17 Quadratic features (in 2d) ©2015-2016 Emily Fox & Carlos Guestrin Note: we are not including cross terms for simplicity Feature Value Coefficient learned h0(x) 1 h1(x) x[1] h2(x) x[2] h3(x) (x[1])2 h4(x) (x[2])2
  • 13. Machine Learning Specialization18 Degree 6 features (in 2d) ©2015-2016 Emily Fox & Carlos Guestrin Note: we are not including cross terms for simplicity Feature Value Coefficient learned h0(x) 1 21.6 h1(x) x[1] 5.3 h2(x) x[2] -42.7 h3(x) (x[1])2 -15.9 h4(x) (x[2])2 -48.6 h5(x) (x[1])3 -11.0 h6(x) (x[2])3 67.0 h7(x) (x[1])4 1.5 h8(x) (x[2])4 48.0 h9(x) (x[1])5 4.4 h10(x) (x[2])5 -14.2 h11(x) (x[1])6 0.8 h12(x) (x[2])6 -8.6 Score(x) < 0 Coefficient values getting large
  • 14. Machine Learning Specialization19 Degree 20 features (in 2d) ©2015-2016 Emily Fox & Carlos Guestrin Note: we are not including cross terms for simplicity Feature Value Coefficient learned h0(x) 1 8.7 h1(x) x[1] 5.1 h2(x) x[2] 78.7 … … … h11(x) (x[1])6 -7.5 h12(x) (x[2])6 3803 h13(x) (x[1])7 21.1 h14(x) (x[2])7 -2406 … … … h37(x) (x[1])19 -2*10-6 h38(x) (x[2])19 -0.15 h39(x) (x[1])20 -2*10-8 h40(x) (x[2])20 0.03 Often, overfitting associated with very large estimated coefficients ŵ
  • 15. Machine Learning Specialization20 Overfitting in classification ©2015-2016 Emily Fox & Carlos Guestrin Model complexity Error= ClassificationError Overfitting if there exists w*: •  training_error(w*) > training_error(ŵ) •  true_error(w*) < true_error(ŵ)
  • 16. Machine Learning Specialization Overfitting in classifiers è Overconfident predictions ©2015-2016 Emily Fox & Carlos Guestrin
  • 17. Machine Learning Specialization23 Logistic regression model ©2015-2016 Emily Fox & Carlos Guestrin -∞ +∞ wTh(xi) 0.0 1.0 0.0 0.5 P(y=+1|xi,w) = sigmoid(wTh(xi))
  • 18. Machine Learning Specialization24 The subtle (negative) consequence of overfitting in logistic regression ©2015-2016 Emily Fox & Carlos Guestrin Overfitting è Large coefficient values ŵTh(xi) is very positive (or very negative) è sigmoid(ŵTh(xi)) goes to 1 (or to 0) Model becomes extremely overconfident of predictions
  • 19. Machine Learning Specialization26 Effect of coefficients on logistic regression model ©2015-2016 Emily Fox & Carlos Guestrin w0 0 w#awesome +1 w#awful -1 w0 0 w#awesome +6 w#awful -6 w0 0 w#awesome +2 w#awful -2 1 1+ew>h(x) #awesome - #awful 1 1+ew>h(x) #awesome - #awful 1 1+ew>h(x) #awesome - #awful Input x: #awesome=2, #awful=1
  • 20. Machine Learning Specialization27 Learned probabilities ©2015-2016 Emily Fox & Carlos Guestrin Feature Value Coefficient learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 P(y = +1 | x, w) = 1 1 + e w>h(x)
  • 21. Machine Learning Specialization28 Quadratic features: Learned probabilities ©2015-2016 Emily Fox & Carlos Guestrin P(y = +1 | x, w) = 1 1 + e w>h(x) Feature Value Coefficient learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.58 h3(x) (x[1])2 -0.17 h4(x) (x[2])2 -0.96
  • 22. Machine Learning Specialization29 Overfitting è Overconfident predictions ©2015-2016 Emily Fox & Carlos Guestrin Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions è Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! L
  • 23. Machine Learning Specialization Overfitting in logistic regression: Another perspective ©2015-2016 Emily Fox & Carlos Guestrin OPTIONAL
  • 24. Machine Learning Specialization33 Linearly-separable data Data are linearly separable if: •  There exist coefficients ŵ such that: - For all positive training data - For all negative training data ©2015-2016 Emily Fox & Carlos Guestrin Note 1: If you are using D features, linear separability happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0
  • 25. Machine Learning Specialization34 Effect of linear separability on coefficients ©2015-2016 Emily Fox & Carlos Guestrin x[1]=#awesome X[2]=#awful 0 1 2 3 4 … 0 1 2 3 4 … Score(x) > 0 Score(x) < 0 Data are linearly separable with ŵ1=1.0 and ŵ2=-1.5 Data also linearly separable with ŵ1=10 and ŵ2=-15 Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109
  • 26. Machine Learning Specialization35 Effect of linear separability on weights ©2015-2016 Emily Fox & Carlos Guestrin Data is linearly separable with ŵ1=1.0 and ŵ2=-1.5 Data also linearly separable with ŵ1=10 and ŵ2=-15 Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109 x[1]= 2 x[2]= 1 X[2]=#awful x[1]=#awesome 1 2 3 4 …0 0 1 2 3 4 … P(y=+1|xi, ŵ) Maximum likelihood estimation (MLE) prefers most certain model è Coefficients go to infinity for linearly-separable data!!!
  • 27. Machine Learning Specialization37 Overfitting in logistic regression is “twice as bad” Learning tries to find decision boundary that separates data Overly complex boundary If data are linearly separable Coefficients go to infinity! ©2015-2016 Emily Fox & Carlos Guestrin ŵ1=109 ŵ2=-1.5x109
  • 28. Machine Learning Specialization Penalizing large coefficients to mitigate overfitting ©2015 Emily Fox & Carlos Guestrin
  • 29. Machine Learning Specialization39 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y ŵ h(x)x P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 30. Machine Learning Specialization40 Desired total cost format Want to balance: i.  How well function fits data ii.  Magnitude of coefficients Total quality = measure of fit - measure of magnitude of coefficients ©2015 Emily Fox & Carlos Guestrin (data likelihood) large # = good fit to training data large # = overfit want to balance
  • 31. Machine Learning Specialization41 Maximum likelihood estimation (MLE): Measure of fit = Data likelihood •  Choose coefficients w that maximize likelihood: •  Typically, we use the log of likelihood function (simplifies math and has better convergence properties) ©2015-2016 Emily Fox & Carlos Guestrin `(w) = NY i=1 P(yi | xi, w) `(w) = ln NY i=1 P(yi | xi, w)
  • 32. Machine Learning Specialization43 What summary # is indicative of size of logistic regression coefficients? -  Sum of squares (L2 norm) -  Sum of absolute value (L1 norm) ©2015 Emily Fox & Carlos Guestrin Measure of magnitude of logistic regression coefficients
  • 33. Machine Learning Specialization44 Consider specific total cost Total quality = measure of fit - measure of magnitude of coefficients ©2015 Emily Fox & Carlos Guestrin ℓ(w)(w) ||w||2 2
  • 34. Machine Learning Specialization45 Consider resulting objective What if ŵ selected to minimize If λ=0: If λ=∞: If λ in between: ℓ(w) - ||w||2(w) - ||w||2 tuning parameter = balance of fit and magnitude ©2015 Emily Fox & Carlos Guestrin λ 2
  • 35. Machine Learning Specialization46 Consider resulting objective What if ŵ selected to minimize ©2015 Emily Fox & Carlos Guestrin L2 regularized logistic regression 2 ℓ(w) - ||w||2(w) - ||w||2 tuning parameter = balance of fit and magnitude λ Pick λ using: •  Validation set (for large datasets) •  Cross-validation (for smaller datasets) (see regression course)
  • 36. Machine Learning Specialization48 Bias-variance tradeoff Large λ: high bias, low variance (e.g., ŵ =0 for λ=∞) Small λ: low bias, high variance (e.g., maximum likelihood (MLE) fit of high-order polynomial for λ=0) ©2015 Emily Fox & Carlos Guestrin In essence, λ controls model complexity
  • 37. Machine Learning Specialization Visualizing effect of regularization on logistic regression ©2015-2016 Emily Fox & Carlos Guestrin
  • 38. Machine Learning Specialization51 Degree 20 features, λ=0 ©2015-2016 Emily Fox & Carlos Guestrin Feature Value Coefficient learned h0(x) 1 8.7 h1(x) x[1] 5.1 h2(x) x[2] 78.7 … … … h11(x) (x[1])6 -7.5 h12(x) (x[2])6 3803 h13(x) (x[1])7 21.1 h14(x) (x[2])7 -2406 … … … h37(x) (x[1])19 -2*10-6 h38(x) (x[2])19 -0.15 h39(x) (x[1])20 -2*10-8 h40(x) (x[2])20 0.03 Coefficients range from -3170 to 3803
  • 39. Machine Learning Specialization52 Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 Range of coefficients -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 Decision boundary Degree 20 features, effect of regularization penalty λ ©2015-2016 Emily Fox & Carlos Guestrin
  • 40. Machine Learning Specialization53 Coefficient path ©2015 Emily Fox & Carlos Guestrin λ coefficientsŵj
  • 41. Machine Learning Specialization54 Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 Range of coefficients -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 Learned probabilities Degree 20 features: regularization reduces “overconfidence” ©2015-2016 Emily Fox & Carlos Guestrin
  • 42. Machine Learning Specialization Finding best L2 regularized linear classifier with gradient ascent 56 ©2015-2016 Emily Fox & Carlos Guestrin
  • 43. Machine Learning Specialization57 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y ŵ h(x)x P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 44. Machine Learning Specialization59 Gradient ascent ©2015 Emily Fox & Carlos Guestrin Algorithm: while not converged w(t+1) ß w(t) + η ℓ(w(t)) Δ
  • 45. Machine Learning Specialization61 Gradient of L2 regularized log-likelihood Total quality = measure of fit - measure of magnitude of coefficients ©2015 Emily Fox & Carlos Guestrin ℓ(w)(w) λ ||w||2 2 Total derivative = @||w||2 2 @wj =λ - @`(w) @wj
  • 46. Machine Learning Specialization63 Derivative of (log-)likelihood ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Difference between truth and prediction @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Feature value Derivative of L2 penalty @||w||2 2 @wj = @`(w) @wj =
  • 47. Machine Learning Specialization64 Understanding contribution of L2 regularization ©2015-2016 Emily Fox & Carlos Guestrin - 2 λ wj Impact on wj wj > 0 wj < 0 Term from L2 penalty @`(w) @wj 2 wj
  • 48. Machine Learning Specialization65 Summary of gradient ascent for logistic regression with L2 Regularization ©2015 Emily Fox & Carlos Guestrin init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D partial[j] = wj (t+1) ß wj (t) + η (partial[j] – 2λ wj (t)) t ß t + 1 NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w(t) ) ⌘
  • 49. Machine Learning Specialization Sparse logistic regression with L1 regularization ©2015-2016 Emily Fox & Carlos Guestrin
  • 50. Machine Learning Specialization68 Efficiency: -  If size(w) = 100B, each prediction is expensive -  If ŵ sparse , computation only depends on # of non-zeros Interpretability: -  Which features are relevant for prediction? Recall sparsity (many ŵj=0) gives efficiency and interpretability ©2015 Emily Fox & Carlos Guestrin many zeros ˆyi = sign 0 @ X ˆwj 6=0 ˆwjhj(xi) 1 A
  • 51. Machine Learning Specialization69 Sparse logistic regression Total quality = measure of fit - measure of magnitude of coefficients ©2015 Emily Fox & Carlos Guestrin ℓ(w)(w) ||w||1=|w0|+…+|wD| L1 regularized logistic regression Leads to sparse solutions!
  • 52. Machine Learning Specialization71 L1 regularized logistic regression Just like L2 regularization, solution is governed by a continuous parameter λ If λ=0: If λ=∞: If λ in between: ℓ(w) - λ||w||1(w) - λ||w||1 tuning parameter = balance of fit and sparsity ©2015 Emily Fox & Carlos Guestrin
  • 53. Machine Learning Specialization72 Regularization path – L2 penalty ©2015 Emily Fox & Carlos Guestrin λ Coefficientsŵj
  • 54. Machine Learning Specialization73 Regularization path – L1 penalty ©2015 Emily Fox & Carlos Guestrin λ Coefficientsŵj
  • 55. Machine Learning Specialization Summary of overfitting in logistic regression ©2015-2016 Emily Fox & Carlos Guestrin
  • 56. Machine Learning Specialization76 What you can do now… •  Identify when overfitting is happening •  Relate large learned coefficients to overfitting •  Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers •  Motivate the form of L2 regularized logistic regression quality metric •  Describe what happens to estimated coefficients as tuning parameter λ is varied •  Interpret coefficient path plot •  Estimate L2 regularized logistic regression coefficients using gradient ascent •  Describe the use of L1 regularization to obtain sparse logistic regression solutions ©2015-2016 Emily Fox & Carlos Guestrin