SlideShare a Scribd company logo
1 of 77
Download to read offline
Machine Learning Specialization
Linear classifiers:
Parameter learning
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization2
Learn a probabilistic classification model
©2015-2016 Emily Fox & Carlos Guestrin
Definite
“The sushi & everything
else were awesome!”
“The sushi was good,
the service was OK”
Not sure
P(y=+1|x= )
= 0.99
P(y=+1|x= )
= 0.55
“The sushi & everything
else were awesome!”	
“The sushi was good,
the service was OK”	
Many classifiers provide a degree of certainty:
P(y|x)
Extremely useful in practice
Output label Input sentence
Machine Learning Specialization3
A (linear) classifier
•  Will use training data to learn a weight
or coefficient for each word
©2015-2016 Emily Fox & Carlos Guestrin
Word Coefficient Value
ŵ0 -2.0
good ŵ1 1.0
great ŵ2 1.5
awesome ŵ3 2.7
bad ŵ4 -1.0
terrible ŵ5 -2.1
awful ŵ6 -3.3
restaurant, the, we, … ŵ7, ŵ8, ŵ9,… 0.0
… …
Machine Learning Specialization4
Logistic regression model
©2015-2016 Emily Fox & Carlos Guestrin
-∞	 +∞	
Score(xi) = ŵTh(xi)	
0.0	 1.0	
0.0	
0.5	
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization
Quality metric for logistic regression:
Maximum likelihood estimation
5 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization6 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y ŵ
h(x)x
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization7
Learning problem
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
4 1 +1
1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
Training data:
N observations (xi,yi)
Optimize
quality metric
on training
data
ŵ
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin8
Machine Learning Specialization9
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
4 1 +1
1 1 +1
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
Machine Learning Specialization10
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
4 1 +1
1 1 +1
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
4 1 +1
1 1 +1
2 1 +1
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
Machine Learning Specialization11
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
4 1 +1
1 1 +1
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
P(y=+1|xi,w) = 0.0	 P(y=+1|xi,w) = 1.0	
Pick ŵ that makes
Machine Learning Specialization12
Quality metric = Likelihood function
©2015-2016 Emily Fox & Carlos Guestrin
No ŵ achieves perfect predictions (usually)
P(y=+1|xi,w) = 0.0	 P(y=+1|xi,w) = 1.0	
Positive data pointsNegative data points
Likelihood ℓ(w): Measures quality of
fit for model with coefficients w
Machine Learning Specialization13
Find “best” classifier
©2015-2016 Emily Fox & Carlos Guestrin
#awesome
#awful
0 1 2 3 4 …
0
1
2
3
4
…
Maximize likelihood over all possible w0,w1,w2
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
ℓ(w0=1, w1=1, w2=-1.5) = 10-5
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
Best model:
Highest likelihood ℓ(w)
ŵ = (w0=1, w1=0.5, w2=-1.5)
Machine Learning Specialization
Data likelihood
©2015-2016 Emily Fox & Carlos Guestrin14
Machine Learning Specialization15
Pick w to maximize:
If model good, should predict:
	
Quality metric: probability of data
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
Pick w to maximize:
If model good, should predict:
	
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
Machine Learning Specialization16
Maximizing likelihood
(probability of data)
Data
point
x[1] x[2] y Choose w to maximize
x1,y1
2 1 +1
x2,y2
0 2 -1
x3,y3
3 3 -1
x4,y4
4 1 +1
x5,y5
1 1 +1
x6,y6
2 4 -1
x7,y7
0 3 -1
x8,y8
0 1 -1
x9,y9
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
Must combine into single
measure of quality
Machine Learning Specialization17
Learn logistic regression model with
maximum likelihood estimation (MLE)
Data
point
x[1] x[2] y Choose w to maximize
x1,y1
2 1 +1
x2,y2
0 2 -1
x3,y3
3 3 -1
x4,y4
4 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
P(y=+1|x[1]=2, x[2]=1,w)
P(y=-1|x[1]=0, x[2]=2,w)
P(y=-1|x[1]=3, x[2]=3,w)
P(y=+1|x[1]=4, x[2]=1,w)
ℓ(w) = 	
P(y1|x1,w) P(y2|x2,w) P(y3|x3,w) P(y4|x4,w)
NY
i=1
P(yi | xi, w)
Machine Learning Specialization
MOVE TO
FULL BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin18
Machine Learning Specialization
Finding best linear classifier
with gradient ascent
19 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization20 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y ŵ
h(x)x
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin21
Machine Learning Specialization22
Find “best” classifier
©2015-2016 Emily Fox & Carlos Guestrin
#awesome
#awful
0 1 2 3 4 …
0
1
2
3
4
…
Maximize likelihood over all possible w0,w1,w2
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
ℓ(w0=1, w1=1, w2=-1.5) = 10-5
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
Best model:
Highest likelihood ℓ(w)
ŵ = (w0=1, w1=0.5, w2=-1.5)
`(w) =
NY
i=1
P(yi | xi, w)
Machine Learning Specialization23
Maximizing likelihood
©2015-2016 Emily Fox & Carlos Guestrin
max
w0,w1,w2
ℓ(w0,w1,w2) is a
function of 3 variables
Maximize function over all
possible w0,w1,w2
NY
i=1
P(yi | xi, w)
No closed-form solution è use gradient ascent
Machine Learning Specialization
MOVE TO
FULL BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin24
Machine Learning Specialization
Review of gradient ascent
25 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin26
Machine Learning Specialization27
Finding the max
via hill climbing
©2015-2016 Emily Fox & Carlos Guestrin
Algorithm:
while not converged
w(t+1) ß w(t) + η dℓ
dw
w(t)
Machine Learning Specialization
Convergence criteria
For convex functions,
optimum occurs when
In practice, stop when
©2015-2016 Emily Fox & Carlos Guestrin
Algorithm:
while not converged
w(t+1) ß w(t) + η dℓ
dw
w(t)	
28
Machine Learning Specialization29
Moving to multiple dimensions:
Gradients
©2015-2016 Emily Fox & Carlos Guestrin
Δ
ℓ(w) =(w) =
Machine Learning Specialization30
Contour plots
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization31
Gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
Algorithm:
while not converged
w(t+1) ß w(t) + η ℓ(w(t))
Δ
Machine Learning Specialization
MOVE TO
FULL BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin32
Machine Learning Specialization
Learning algorithm for
logistic regression
33 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin34
Machine Learning Specialization35
Derivative of (log-)likelihood
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Difference between truth and prediction
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Feature
value
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
Indicator function:
Machine Learning Specialization36
Computing derivative
x[1] x[2] y P(y=+1|xi,w)
Contribution to
derivative for w1
2 1 +1 0.5
0 2 -1 0.02
3 3 -1 0.05
4 1 +1 0.88
©2015-2016 Emily Fox & Carlos Guestrin
Total derivative:
@`(w(t)
)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w(t)
)
⌘
w0
0
w1 1
w2 -2
(t)	
(t)	
(t)
Machine Learning Specialization37
Derivative of (log-)likelihood: Interpretation
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Difference between truth and prediction
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Feature
value
If hj(xi)=1:	 P(y=+1|xi,w) ≈ 1	 P(y=+1|xi,w) ≈ 0	
yi=+1	
yi=-1
Machine Learning Specialization38
Summary of gradient ascent
for logistic regression
©2015-2016 Emily Fox & Carlos Guestrin
init w(1)=0 (or randomly, or smartly), t=1
while || ℓ(w(t))|| > ε
for j=0,…,D
partial[j] =
wj
(t+1) ß wj
(t) + η partial[j]
t ß t + 1
Δ
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w(t)
)
⌘
Machine Learning Specialization
MOVE TO
FULL BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin39
Machine Learning Specialization
Choosing the step size η
40 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin41
Machine Learning Specialization42
Learning curve:
Plot quality (likelihood) over iterations
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization43
If step size is too small, can take a
long time to converge
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization44
Compare converge with
different step sizes
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization45
Careful with step sizes that are too large
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization46
Very large step sizes can even
cause divergence or wild oscillations
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization47
Simple rule of thumb for picking step size η
•  Unfortunately, picking step size
requires a lot of trial and error L
•  Try a several values, exponentially spaced
- Goal: plot learning curves to
•  find one η that is too small (smooth but moving too slowly)
•  find one η that is too large (oscillation or divergence)
•  Try values in between to find “best” η
•  Advanced tip: can also try step size that decreases with
iterations, e.g.,
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
MOVE TO
FULL BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin48
Machine Learning Specialization
Deriving the gradient for
logistic regression
49 ©2015-2016 Emily Fox & Carlos Guestrin
VERY
OPTIONAL
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin50
Machine Learning Specialization51
Log-likelihood function
•  Goal: choose coefficients w maximizing likelihood:
•  Math simplified by using log-likelihood – taking (natural) log:
©2015-2016 Emily Fox & Carlos Guestrin
`(w) =
NY
i=1
P(yi | xi, w)
``(w) = ln
NY
i=1
P(yi | xi, w)
Machine Learning Specialization52
The log trick, often used in ML…
•  Products become sums:
•  Doesn’t change maximum!
- If ŵ maximizes f(w):
- Then ŵ maximizes ln(f(w)):
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Insert next title slide before Slide 52,
around 4:55 in
PL7_DerivingtheGradient_1stEdit
©2015-2016 Emily Fox & Carlos Guestrin53
Machine Learning Specialization
Expressing the log-likelihood
54 ©2015-2016 Emily Fox & Carlos Guestrin
VERY
OPTIONAL
Machine Learning Specialization55
Using log to turn products into sums
•  The log of the product of likelihoods becomes the sum of the logs:
©2015-2016 Emily Fox & Carlos Guestrin
``(w) = ln
NY
i=1
P(yi | xi, w)
Machine Learning Specialization56
Rewriting log-likelihood
•  For simpler math, we’ll rewrite likelihood with indicators:
©2015-2016 Emily Fox & Carlos Guestrin
``(w) =
NX
i=1
ln P(yi | xi, w)
=
NX
i=1
[1[yi = +1] ln P(y = +1 | xi, w) + 1[yi = 1] ln P(y = 1 | xi, w)]
Machine Learning Specialization
Insert next title slide before Slide 54,
around 7:33 in
PL7_DerivingtheGradient_1stEdit
©2015-2016 Emily Fox & Carlos Guestrin57
Machine Learning Specialization
Deriving probability that y=-1 given x
58 ©2015-2016 Emily Fox & Carlos Guestrin
VERY
OPTIONAL
Machine Learning Specialization59
Logistic regression model: P(y=-1|x,w)
•  Probability model predicts y=+1:
•  Probability model predicts y=-1:
©2015-2016 Emily Fox & Carlos Guestrin
P(y=+1|x,w) = 1 . 	
1 + e-w h(x)	
T
Machine Learning Specialization
Insert next title slide before Slide 55,
around 9:15 in
PL7_DerivingtheGradient_1stEdit
©2015-2016 Emily Fox & Carlos Guestrin60
Machine Learning Specialization
Rewriting the log-likelihood
61 ©2015-2016 Emily Fox & Carlos Guestrin
VERY
OPTIONAL
Machine Learning Specialization62
Plugging in logistic function for 1 data point
©2015-2016 Emily Fox & Carlos Guestrin
P(y = +1 | x, w) =
1
1 + e w>h(x)
P(y = 1 | x, w) =
e w>
h(x)
1 + e w>h(x)
``(w) = 1[yi = +1] ln P(y = +1 | xi, w) + 1[yi = 1] ln P(y = 1 | xi, w)
Machine Learning Specialization
Insert next title slide before Slide 56,
around 16:56 in
PL7_DerivingtheGradient_1stEdit
©2015-2016 Emily Fox & Carlos Guestrin63
Machine Learning Specialization
Deriving gradient of log-likelihood
64 ©2015-2016 Emily Fox & Carlos Guestrin
VERY
OPTIONAL
Machine Learning Specialization65
Gradient for 1 data point
©2015-2016 Emily Fox & Carlos Guestrin
``(w) = (1 1[yi = +1])w>
h(xi) ln
⇣
1 + e w>
h(xi)
⌘
Machine Learning Specialization66
Finally, gradient for all data points
•  Gradient for one data point:
•  Adding over data points:
©2015-2016 Emily Fox & Carlos Guestrin
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Machine Learning Specialization
MOVE TO
FULL BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin67
Machine Learning Specialization
Summary of logistic
regression classifier
68 ©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
MOVE TO
HEAD SHOT
©2015-2016 Emily Fox & Carlos Guestrin69
Machine Learning Specialization70 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y ŵ
h(x)x
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization
MOVE TO FULL
BODY SHOT
©2015-2016 Emily Fox & Carlos Guestrin71
Machine Learning Specialization72
What you can do now…
•  Measure quality of a classifier using
the likelihood function
•  Interpret the likelihood function as
the probability of the observed data
•  Learn a logistic regression model with
gradient descent
•  (Optional) Derive the gradient
descent update rule for logistic
regression
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization73
Simplest link function: sign(z)
©2015-2016 Emily Fox & Carlos Guestrin
-∞	 +∞	
z	
0.0	 1.0	
sign(z) =
⇢
+1 if z 0
1 otherwise
sign(z)	
But, sign(z) only outputs -1 or +1,
no probabilities in between
Machine Learning Specialization74
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
4 1 +1
1 1 +1
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
-∞	 +∞	
Score(xi) = ŵTh(xi)	
0.0	 1.0	P(y=+1|xi,ŵ)
Machine Learning Specialization75
Choose w to make
Increase probability y=+1 when
If model good, should predict
	
Quality metric: probability of data
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
Choose w to make
Increase probability y=-1 when
If model good, should predict
	
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
P(y=+1|x,ŵ) = 1 . 	
⌃
1 + e-ŵ h(x)	
T
Machine Learning Specialization76
Maximizing likelihood
(probability of data)
Data
point
x[1] x[2] y
Choose w to
maximize
x1,y1
2 1 +1
x2,y2
0 2 -1
x3,y3
3 3 -1
x4,y4
4 1 +1
x5,y5
1 1 +1
x6,y6
2 4 -1
x7,y7
0 3 -1
x8,y8
0 1 -1
x9,y9
2 1 +1
©2015-2016 Emily Fox & Carlos Guestrin
Must combine into single
measure of quality
Machine Learning Specialization77
Learn logistic regression model with
maximum likelihood estimation (MLE)
•  Choose coefficients w that maximize likelihood:
•  No closed-form solution è use gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
NY
i=1
P(yi | xi, w)

More Related Content

Viewers also liked

Nurse Burnout Acute Care Capstone Presentation
Nurse Burnout Acute Care Capstone PresentationNurse Burnout Acute Care Capstone Presentation
Nurse Burnout Acute Care Capstone Presentation
Katelyn Duncan
 

Viewers also liked (19)

Tarea 1 uch
Tarea 1 uchTarea 1 uch
Tarea 1 uch
 
Social Sentiment Indices Powered by X-Scores
Social Sentiment Indices Powered by X-ScoresSocial Sentiment Indices Powered by X-Scores
Social Sentiment Indices Powered by X-Scores
 
Mídia Kit Conexão Home Care
Mídia Kit Conexão Home CareMídia Kit Conexão Home Care
Mídia Kit Conexão Home Care
 
Usos de twitter (1)
Usos de twitter (1)Usos de twitter (1)
Usos de twitter (1)
 
INTERNET PROFUNDO - DEEP WEB
INTERNET PROFUNDO - DEEP WEBINTERNET PROFUNDO - DEEP WEB
INTERNET PROFUNDO - DEEP WEB
 
Library outreach program for Al Akhawayn staff members
Library outreach program for Al Akhawayn staff membersLibrary outreach program for Al Akhawayn staff members
Library outreach program for Al Akhawayn staff members
 
Agenda digital-peruana-2.0
Agenda digital-peruana-2.0Agenda digital-peruana-2.0
Agenda digital-peruana-2.0
 
MÚSCULOS DEL MIEMBRO SUPERIOR
MÚSCULOS DEL MIEMBRO SUPERIORMÚSCULOS DEL MIEMBRO SUPERIOR
MÚSCULOS DEL MIEMBRO SUPERIOR
 
Rits Brown Bag - Salesforce Social Studio
Rits Brown Bag - Salesforce Social StudioRits Brown Bag - Salesforce Social Studio
Rits Brown Bag - Salesforce Social Studio
 
Henri Fayol's Modern Management Principles
Henri Fayol's Modern Management PrinciplesHenri Fayol's Modern Management Principles
Henri Fayol's Modern Management Principles
 
Administracion de justicia
Administracion de justiciaAdministracion de justicia
Administracion de justicia
 
Teoria das Cores Aplicada ao Vestuário
Teoria das Cores Aplicada ao VestuárioTeoria das Cores Aplicada ao Vestuário
Teoria das Cores Aplicada ao Vestuário
 
Nurse Burnout Acute Care Capstone Presentation
Nurse Burnout Acute Care Capstone PresentationNurse Burnout Acute Care Capstone Presentation
Nurse Burnout Acute Care Capstone Presentation
 
HISTORIA NATURAL DE LA INTOXICACIÓN ALIMENTARIA
HISTORIA NATURAL DE LA INTOXICACIÓN ALIMENTARIAHISTORIA NATURAL DE LA INTOXICACIÓN ALIMENTARIA
HISTORIA NATURAL DE LA INTOXICACIÓN ALIMENTARIA
 
Pandora Company Report
Pandora Company ReportPandora Company Report
Pandora Company Report
 
Manpower Planning dan Workload Analysis
Manpower Planning dan Workload Analysis Manpower Planning dan Workload Analysis
Manpower Planning dan Workload Analysis
 
Accor hotels
Accor hotelsAccor hotels
Accor hotels
 
Principales novedades del Esquema Nacional de Seguridad (ENS)
Principales novedades del Esquema Nacional de Seguridad (ENS)Principales novedades del Esquema Nacional de Seguridad (ENS)
Principales novedades del Esquema Nacional de Seguridad (ENS)
 
Marc 21
Marc 21Marc 21
Marc 21
 

Similar to C3.2.1

08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx
08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx
08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx
ArianeSantiago7
 
Properties of addition and multiplication
Properties of addition and multiplicationProperties of addition and multiplication
Properties of addition and multiplication
Shiara Agosto
 

Similar to C3.2.1 (20)

C3.7.1
C3.7.1C3.7.1
C3.7.1
 
C3.5.1
C3.5.1C3.5.1
C3.5.1
 
C2.4
C2.4C2.4
C2.4
 
C2.2
C2.2C2.2
C2.2
 
C2.6
C2.6C2.6
C2.6
 
C3.1.2
C3.1.2C3.1.2
C3.1.2
 
C2.3
C2.3C2.3
C2.3
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 
C4.1.2
C4.1.2C4.1.2
C4.1.2
 
Math homework help service
Math homework help serviceMath homework help service
Math homework help service
 
08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx
08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx
08.29.2017 Daily Lesson Properities of Ratioanl Numbers.pptx
 
Math homework help
Math homework helpMath homework help
Math homework help
 
Statistics help online
Statistics help onlineStatistics help online
Statistics help online
 
Lec05.pptx
Lec05.pptxLec05.pptx
Lec05.pptx
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Mathematical Finance
Mathematical Finance Mathematical Finance
Mathematical Finance
 
Complex Numbers 1 - Math Academy - JC H2 maths A levels
Complex Numbers 1 - Math Academy - JC H2 maths A levelsComplex Numbers 1 - Math Academy - JC H2 maths A levels
Complex Numbers 1 - Math Academy - JC H2 maths A levels
 
Properties of addition and multiplication
Properties of addition and multiplicationProperties of addition and multiplication
Properties of addition and multiplication
 
Coursera 1week
Coursera  1weekCoursera  1week
Coursera 1week
 
Problem statement mathematical foundations
Problem statement mathematical foundationsProblem statement mathematical foundations
Problem statement mathematical foundations
 

More from Daniel LIAO (11)

C4.5
C4.5C4.5
C4.5
 
C4.4
C4.4C4.4
C4.4
 
C4.3.1
C4.3.1C4.3.1
C4.3.1
 
Lime
LimeLime
Lime
 
C4.1.1
C4.1.1C4.1.1
C4.1.1
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
C3.4.2
C3.4.2C3.4.2
C3.4.2
 
C3.4.1
C3.4.1C3.4.1
C3.4.1
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
 
C2.5
C2.5C2.5
C2.5
 
C3.1 intro
C3.1 introC3.1 intro
C3.1 intro
 

Recently uploaded

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

C3.2.1

  • 1. Machine Learning Specialization Linear classifiers: Parameter learning Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington 1 ©2015-2016 Emily Fox & Carlos Guestrin
  • 2. Machine Learning Specialization2 Learn a probabilistic classification model ©2015-2016 Emily Fox & Carlos Guestrin Definite “The sushi & everything else were awesome!” “The sushi was good, the service was OK” Not sure P(y=+1|x= ) = 0.99 P(y=+1|x= ) = 0.55 “The sushi & everything else were awesome!” “The sushi was good, the service was OK” Many classifiers provide a degree of certainty: P(y|x) Extremely useful in practice Output label Input sentence
  • 3. Machine Learning Specialization3 A (linear) classifier •  Will use training data to learn a weight or coefficient for each word ©2015-2016 Emily Fox & Carlos Guestrin Word Coefficient Value ŵ0 -2.0 good ŵ1 1.0 great ŵ2 1.5 awesome ŵ3 2.7 bad ŵ4 -1.0 terrible ŵ5 -2.1 awful ŵ6 -3.3 restaurant, the, we, … ŵ7, ŵ8, ŵ9,… 0.0 … …
  • 4. Machine Learning Specialization4 Logistic regression model ©2015-2016 Emily Fox & Carlos Guestrin -∞ +∞ Score(xi) = ŵTh(xi) 0.0 1.0 0.0 0.5 P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 5. Machine Learning Specialization Quality metric for logistic regression: Maximum likelihood estimation 5 ©2015-2016 Emily Fox & Carlos Guestrin
  • 6. Machine Learning Specialization6 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y ŵ h(x)x P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 7. Machine Learning Specialization7 Learning problem x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 0 2 -1 3 3 -1 4 1 +1 1 1 +1 2 4 -1 0 3 -1 0 1 -1 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin Training data: N observations (xi,yi) Optimize quality metric on training data ŵ
  • 8. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin8
  • 9. Machine Learning Specialization9 Finding best coefficients x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 4 1 +1 1 1 +1 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1
  • 10. Machine Learning Specialization10 Finding best coefficients x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 4 1 +1 1 1 +1 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1 x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 4 1 +1 1 1 +1 2 1 +1 x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1
  • 11. Machine Learning Specialization11 Finding best coefficients x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 4 1 +1 1 1 +1 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1 P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0 Pick ŵ that makes
  • 12. Machine Learning Specialization12 Quality metric = Likelihood function ©2015-2016 Emily Fox & Carlos Guestrin No ŵ achieves perfect predictions (usually) P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0 Positive data pointsNegative data points Likelihood ℓ(w): Measures quality of fit for model with coefficients w
  • 13. Machine Learning Specialization13 Find “best” classifier ©2015-2016 Emily Fox & Carlos Guestrin #awesome #awful 0 1 2 3 4 … 0 1 2 3 4 … Maximize likelihood over all possible w0,w1,w2 ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4 ℓ(w0=1, w1=1, w2=-1.5) = 10-5 ℓ(w0=0, w1=1, w2=-1.5) = 10-6 Best model: Highest likelihood ℓ(w) ŵ = (w0=1, w1=0.5, w2=-1.5)
  • 14. Machine Learning Specialization Data likelihood ©2015-2016 Emily Fox & Carlos Guestrin14
  • 15. Machine Learning Specialization15 Pick w to maximize: If model good, should predict: Quality metric: probability of data ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 Pick w to maximize: If model good, should predict: x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1
  • 16. Machine Learning Specialization16 Maximizing likelihood (probability of data) Data point x[1] x[2] y Choose w to maximize x1,y1 2 1 +1 x2,y2 0 2 -1 x3,y3 3 3 -1 x4,y4 4 1 +1 x5,y5 1 1 +1 x6,y6 2 4 -1 x7,y7 0 3 -1 x8,y8 0 1 -1 x9,y9 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin Must combine into single measure of quality
  • 17. Machine Learning Specialization17 Learn logistic regression model with maximum likelihood estimation (MLE) Data point x[1] x[2] y Choose w to maximize x1,y1 2 1 +1 x2,y2 0 2 -1 x3,y3 3 3 -1 x4,y4 4 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin P(y=+1|x[1]=2, x[2]=1,w) P(y=-1|x[1]=0, x[2]=2,w) P(y=-1|x[1]=3, x[2]=3,w) P(y=+1|x[1]=4, x[2]=1,w) ℓ(w) = P(y1|x1,w) P(y2|x2,w) P(y3|x3,w) P(y4|x4,w) NY i=1 P(yi | xi, w)
  • 18. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin18
  • 19. Machine Learning Specialization Finding best linear classifier with gradient ascent 19 ©2015-2016 Emily Fox & Carlos Guestrin
  • 20. Machine Learning Specialization20 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y ŵ h(x)x P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 21. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin21
  • 22. Machine Learning Specialization22 Find “best” classifier ©2015-2016 Emily Fox & Carlos Guestrin #awesome #awful 0 1 2 3 4 … 0 1 2 3 4 … Maximize likelihood over all possible w0,w1,w2 ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4 ℓ(w0=1, w1=1, w2=-1.5) = 10-5 ℓ(w0=0, w1=1, w2=-1.5) = 10-6 Best model: Highest likelihood ℓ(w) ŵ = (w0=1, w1=0.5, w2=-1.5) `(w) = NY i=1 P(yi | xi, w)
  • 23. Machine Learning Specialization23 Maximizing likelihood ©2015-2016 Emily Fox & Carlos Guestrin max w0,w1,w2 ℓ(w0,w1,w2) is a function of 3 variables Maximize function over all possible w0,w1,w2 NY i=1 P(yi | xi, w) No closed-form solution è use gradient ascent
  • 24. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin24
  • 25. Machine Learning Specialization Review of gradient ascent 25 ©2015-2016 Emily Fox & Carlos Guestrin
  • 26. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin26
  • 27. Machine Learning Specialization27 Finding the max via hill climbing ©2015-2016 Emily Fox & Carlos Guestrin Algorithm: while not converged w(t+1) ß w(t) + η dℓ dw w(t)
  • 28. Machine Learning Specialization Convergence criteria For convex functions, optimum occurs when In practice, stop when ©2015-2016 Emily Fox & Carlos Guestrin Algorithm: while not converged w(t+1) ß w(t) + η dℓ dw w(t) 28
  • 29. Machine Learning Specialization29 Moving to multiple dimensions: Gradients ©2015-2016 Emily Fox & Carlos Guestrin Δ ℓ(w) =(w) =
  • 30. Machine Learning Specialization30 Contour plots ©2015-2016 Emily Fox & Carlos Guestrin
  • 31. Machine Learning Specialization31 Gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin Algorithm: while not converged w(t+1) ß w(t) + η ℓ(w(t)) Δ
  • 32. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin32
  • 33. Machine Learning Specialization Learning algorithm for logistic regression 33 ©2015-2016 Emily Fox & Carlos Guestrin
  • 34. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin34
  • 35. Machine Learning Specialization35 Derivative of (log-)likelihood ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Difference between truth and prediction @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Feature value @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) Indicator function:
  • 36. Machine Learning Specialization36 Computing derivative x[1] x[2] y P(y=+1|xi,w) Contribution to derivative for w1 2 1 +1 0.5 0 2 -1 0.02 3 3 -1 0.05 4 1 +1 0.88 ©2015-2016 Emily Fox & Carlos Guestrin Total derivative: @`(w(t) ) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w(t) ) ⌘ w0 0 w1 1 w2 -2 (t) (t) (t)
  • 37. Machine Learning Specialization37 Derivative of (log-)likelihood: Interpretation ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Difference between truth and prediction @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Feature value If hj(xi)=1: P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0 yi=+1 yi=-1
  • 38. Machine Learning Specialization38 Summary of gradient ascent for logistic regression ©2015-2016 Emily Fox & Carlos Guestrin init w(1)=0 (or randomly, or smartly), t=1 while || ℓ(w(t))|| > ε for j=0,…,D partial[j] = wj (t+1) ß wj (t) + η partial[j] t ß t + 1 Δ NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w(t) ) ⌘
  • 39. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin39
  • 40. Machine Learning Specialization Choosing the step size η 40 ©2015-2016 Emily Fox & Carlos Guestrin
  • 41. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin41
  • 42. Machine Learning Specialization42 Learning curve: Plot quality (likelihood) over iterations ©2015-2016 Emily Fox & Carlos Guestrin
  • 43. Machine Learning Specialization43 If step size is too small, can take a long time to converge ©2015-2016 Emily Fox & Carlos Guestrin
  • 44. Machine Learning Specialization44 Compare converge with different step sizes ©2015-2016 Emily Fox & Carlos Guestrin
  • 45. Machine Learning Specialization45 Careful with step sizes that are too large ©2015-2016 Emily Fox & Carlos Guestrin
  • 46. Machine Learning Specialization46 Very large step sizes can even cause divergence or wild oscillations ©2015-2016 Emily Fox & Carlos Guestrin
  • 47. Machine Learning Specialization47 Simple rule of thumb for picking step size η •  Unfortunately, picking step size requires a lot of trial and error L •  Try a several values, exponentially spaced - Goal: plot learning curves to •  find one η that is too small (smooth but moving too slowly) •  find one η that is too large (oscillation or divergence) •  Try values in between to find “best” η •  Advanced tip: can also try step size that decreases with iterations, e.g., ©2015-2016 Emily Fox & Carlos Guestrin
  • 48. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin48
  • 49. Machine Learning Specialization Deriving the gradient for logistic regression 49 ©2015-2016 Emily Fox & Carlos Guestrin VERY OPTIONAL
  • 50. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin50
  • 51. Machine Learning Specialization51 Log-likelihood function •  Goal: choose coefficients w maximizing likelihood: •  Math simplified by using log-likelihood – taking (natural) log: ©2015-2016 Emily Fox & Carlos Guestrin `(w) = NY i=1 P(yi | xi, w) ``(w) = ln NY i=1 P(yi | xi, w)
  • 52. Machine Learning Specialization52 The log trick, often used in ML… •  Products become sums: •  Doesn’t change maximum! - If ŵ maximizes f(w): - Then ŵ maximizes ln(f(w)): ©2015-2016 Emily Fox & Carlos Guestrin
  • 53. Machine Learning Specialization Insert next title slide before Slide 52, around 4:55 in PL7_DerivingtheGradient_1stEdit ©2015-2016 Emily Fox & Carlos Guestrin53
  • 54. Machine Learning Specialization Expressing the log-likelihood 54 ©2015-2016 Emily Fox & Carlos Guestrin VERY OPTIONAL
  • 55. Machine Learning Specialization55 Using log to turn products into sums •  The log of the product of likelihoods becomes the sum of the logs: ©2015-2016 Emily Fox & Carlos Guestrin ``(w) = ln NY i=1 P(yi | xi, w)
  • 56. Machine Learning Specialization56 Rewriting log-likelihood •  For simpler math, we’ll rewrite likelihood with indicators: ©2015-2016 Emily Fox & Carlos Guestrin ``(w) = NX i=1 ln P(yi | xi, w) = NX i=1 [1[yi = +1] ln P(y = +1 | xi, w) + 1[yi = 1] ln P(y = 1 | xi, w)]
  • 57. Machine Learning Specialization Insert next title slide before Slide 54, around 7:33 in PL7_DerivingtheGradient_1stEdit ©2015-2016 Emily Fox & Carlos Guestrin57
  • 58. Machine Learning Specialization Deriving probability that y=-1 given x 58 ©2015-2016 Emily Fox & Carlos Guestrin VERY OPTIONAL
  • 59. Machine Learning Specialization59 Logistic regression model: P(y=-1|x,w) •  Probability model predicts y=+1: •  Probability model predicts y=-1: ©2015-2016 Emily Fox & Carlos Guestrin P(y=+1|x,w) = 1 . 1 + e-w h(x) T
  • 60. Machine Learning Specialization Insert next title slide before Slide 55, around 9:15 in PL7_DerivingtheGradient_1stEdit ©2015-2016 Emily Fox & Carlos Guestrin60
  • 61. Machine Learning Specialization Rewriting the log-likelihood 61 ©2015-2016 Emily Fox & Carlos Guestrin VERY OPTIONAL
  • 62. Machine Learning Specialization62 Plugging in logistic function for 1 data point ©2015-2016 Emily Fox & Carlos Guestrin P(y = +1 | x, w) = 1 1 + e w>h(x) P(y = 1 | x, w) = e w> h(x) 1 + e w>h(x) ``(w) = 1[yi = +1] ln P(y = +1 | xi, w) + 1[yi = 1] ln P(y = 1 | xi, w)
  • 63. Machine Learning Specialization Insert next title slide before Slide 56, around 16:56 in PL7_DerivingtheGradient_1stEdit ©2015-2016 Emily Fox & Carlos Guestrin63
  • 64. Machine Learning Specialization Deriving gradient of log-likelihood 64 ©2015-2016 Emily Fox & Carlos Guestrin VERY OPTIONAL
  • 65. Machine Learning Specialization65 Gradient for 1 data point ©2015-2016 Emily Fox & Carlos Guestrin ``(w) = (1 1[yi = +1])w> h(xi) ln ⇣ 1 + e w> h(xi) ⌘
  • 66. Machine Learning Specialization66 Finally, gradient for all data points •  Gradient for one data point: •  Adding over data points: ©2015-2016 Emily Fox & Carlos Guestrin hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘
  • 67. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin67
  • 68. Machine Learning Specialization Summary of logistic regression classifier 68 ©2015-2016 Emily Fox & Carlos Guestrin
  • 69. Machine Learning Specialization MOVE TO HEAD SHOT ©2015-2016 Emily Fox & Carlos Guestrin69
  • 70. Machine Learning Specialization70 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y ŵ h(x)x P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 71. Machine Learning Specialization MOVE TO FULL BODY SHOT ©2015-2016 Emily Fox & Carlos Guestrin71
  • 72. Machine Learning Specialization72 What you can do now… •  Measure quality of a classifier using the likelihood function •  Interpret the likelihood function as the probability of the observed data •  Learn a logistic regression model with gradient descent •  (Optional) Derive the gradient descent update rule for logistic regression ©2015-2016 Emily Fox & Carlos Guestrin
  • 73. Machine Learning Specialization73 Simplest link function: sign(z) ©2015-2016 Emily Fox & Carlos Guestrin -∞ +∞ z 0.0 1.0 sign(z) = ⇢ +1 if z 0 1 otherwise sign(z) But, sign(z) only outputs -1 or +1, no probabilities in between
  • 74. Machine Learning Specialization74 Finding best coefficients x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 4 1 +1 1 1 +1 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1 -∞ +∞ Score(xi) = ŵTh(xi) 0.0 1.0 P(y=+1|xi,ŵ)
  • 75. Machine Learning Specialization75 Choose w to make Increase probability y=+1 when If model good, should predict Quality metric: probability of data ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 2 1 +1 Choose w to make Increase probability y=-1 when If model good, should predict x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 P(y=+1|x,ŵ) = 1 . ⌃ 1 + e-ŵ h(x) T
  • 76. Machine Learning Specialization76 Maximizing likelihood (probability of data) Data point x[1] x[2] y Choose w to maximize x1,y1 2 1 +1 x2,y2 0 2 -1 x3,y3 3 3 -1 x4,y4 4 1 +1 x5,y5 1 1 +1 x6,y6 2 4 -1 x7,y7 0 3 -1 x8,y8 0 1 -1 x9,y9 2 1 +1 ©2015-2016 Emily Fox & Carlos Guestrin Must combine into single measure of quality
  • 77. Machine Learning Specialization77 Learn logistic regression model with maximum likelihood estimation (MLE) •  Choose coefficients w that maximize likelihood: •  No closed-form solution è use gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin NY i=1 P(yi | xi, w)