Intro to Classification: Logistic Regression & SVM

Kriti Puniyani
Carnegie Mellon University
kriti@cmu.edu

About me
 Graduate student at Carnegie Mellon University
 Statistical machine learning
 Topic models
 Sparse network learning
 Optimization
 Domains of interest
 Social media analysis
 Systems biology
 Genetics
 Sentiment analysis
 Text processing

4/15/11 2

Machine learning
 Computers to “learn with experience”
 Learn : to be able to predict “unseen” things.

 Many applications
 Search
 Machine translation
 Speech recognition
 Vision : identify cars, people, sky, apples
 Robot control

 Introductions :
 http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
 http://videolectures.net/mlss2010_lawrence_mlfcs/
4/15/11 3

Classification

 Is this the digit “9” ? ρ

 Will this patient survive ?

 Will this user click on my ad ?

4/15/11 4

Predict the next coin toss
Data
Task

THTTTTHHTHTHTTT

Model 1 : Model 2 :
Coin is tossed with Toss depends on wind
probability p (of condition W, starting
being tails) pose S, torque T

Parameters 4/15/11 5


THTTTTHHTHTHTTT

Learning

Model 2 :
Model 1 : W=12.2, S=1,
p=2/3 T=0.23
4/15/11 6

I predict the next
toss to be T

Inference

Model 2 :
Model 1 : W=12.2, S=1,
P=2/3 T=0.23
4/15/11 7

Inference
 Parameter : p=2/3

 Predicted next 9 tosses ….H H H T T T T T T
Observed next 9 tosses ….T T T T T T H H H
Accuracy = 2/9 

 Predicted next 9 tosses ….T T T T T T T T T
Observed next 9 tosses ….T T T T T T H H H
Accuracy = 6/9 

 Inference rule :
 if p > 0.5, always predict T,
 if p < 0.5 always predict H. 4/15/11 8

The anatomy of classification
1. What is the data (features X, label y) ? ★★★
2. What is the model ? Model parameterization (w)
3. Inference : Given X, w, predict the label.
4. Learning : Given (X,y) pairs, learn the “best” w
 Define “best” – maximize an objective function

(X,Y) pairs train time
Learning
w

test time
(X, ? ) Inference predicted Y
9

Logistic Regression

4/15/11 10

Predict speaker success
 X = Number of hours spent in preparation
 Y = Was the speaker “good”?

I(a) = 1 if(a==TRUE)
Prediction : Y = I ( X > h) = 0 if(a==FALSE)

4/15/11 11

Predict speaker success
Y = I ( X > h)

 Learning is difficult.
 Not robust

4/15/11 12

1
P(Y | w, X) =
1+e−wX +w0

€

Y = I ( X > 10)

4/15/11 13

Logistic (sigmoidal) function

4/15/11 14

Extend to d dimensions

  1
P(Y | w, X ) = −( w1 X 1 +w 2 X 2 +...+w d X d +w 0 )
1+e

  1
P(Y | w, X ) =  
€ 1+e
−( w. X +w 0 )

4/15/11 15
€

Logistic regression
 Model parameter : w
1
P(Y = 1 | w, X) = −wX +w 0
1+e
 Example : Given X = 0.9 , w = 1.2
=> wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75
€ Toss a coin with p=3/4

 Example : Given X = -1.1 , w = 1.2
=> wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2
Toss a coin with p=1/5
4/15/11 16

Another view of logistic regression
 Log odds : ln [ p/(1-p) ] = wX + ε

 p / (1-p) = ewX

 p = (1-p) ewX

 p (1 + ewX) = ewX

 p = ewX / (1 + ewX) = 1/(e-wX+1)

 Logistic regression is a “linear regression” between log-
odds of an event and features (X)

4/15/11 17

The anatomy of classification
1. What is the data (features X, label y) ? ✔
2. What is the model ? Model parameterization (w) ✔
3. Inference :Given X, w, predict the label. ✔
4. Learning : Given (X,y) pairs, learn the “best” w
 Define “best” – maximize an objective function

4/15/11 18

Learning : Finding the best w
Expressing…(X , Y )
 Data : (X , Y ),
1 1
Conditional Log Likelihood
n n

 If yi == 1, max P(yi=1| xi, w)
 If yi == 0, max P(yi=0| xi, w)

 Maximize Log-likelihood

4/15/11 19

Learning : Example

 Data : (5, 0), (11, 1), (25,1)

l(w)= ln P(y = 0 | x = 5,w) + ln P(y = 1 | x = 11,w) + ln P(y = 1 | x = 25,w)

 P(Y=1|X,w) is a logistic function

1 1
©Carlos Guestrin 2005-2009 1 !
l(w)= ln(1− −5w +w 0
) + ln + ln
1+ e 1+ e−11w +w 0 1+ e−25w +w 0

P(y=1|x) + P(y=0|x) = 1 4/15/11 20
€

Optimization : Pick the “best” w

1. Weka
2. Matlab : w = mnrfit(X,Y)
3. R : w <- glm(Y~X, family=binomial(link="logit"))
4. IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m
5. Implement your own  4/15/11 21

Decision surface is linear

Errors

Y=0 Y=1

4/15/11 22

Decision surface is linear

http://www.cs.technion.ac.il/~rani/LocBoost/ 4/15/11 23

So far..
 Logistic regression is a binary classifier (multinomial
version exists)
 P(Y=1|X,w) is a logistic function
 Inference : Compute P(Y=1|X,w), and do “rounding”.
 Parameter learnt by maximizing log-likelihood of data.
 Decision surface is linear (kernelized version exists)

4/15/11 24

Improvements in the model
 Prevent over-fitting Regularization
 Maximize accuracy directly SVMs
 Non-linear decision surface Kernel Trick
 Multi-label data

4/15/11 25

Occam’s razor

The simplest explanation is most likely the correct
one

4/15/11 26

New and improved learning
 “Best” w == maximize log-likelihood
Maximum Log-likelihood Estimate (MLE)

Small concern … over-fitting

If data is
linearly
separable,
w

4/15/11 27

L2 regularization
2
|| w ||2 = ∑ wi
i

2
max w l(w) − λ || w || 2
 Prevents over-fitting
€
 “Pushes” parameters
towards zero
 Equivalent to a prior on
€ the parameters
 Normal distribution (0
mean, unit covariance)

λ : tuning parameter ( 0.1) 4/15/11 28

Patient Diagnosis
 Y = disease
 X = [age, weight, BP, blood sugar, MRI, genetic tests …]

 Don’t want all “features” to be relevant.

 Weight vector w should be “mostly zeros”.

4/15/11 29

L1 regularization || w ||1= ∑ | w i |
i

max w l(w) − λ || w ||1
€
 Prevents over-fitting
 “Pushes” parameters to
zero
 Equivalent to a prior on
€ the parameters
 Laplace distribution

λ increases, more zeros (irrelevant) features 4/15/11 30

L1 v/s L2 example
 MLE estimate : [ 11 0.8 ]

 L2 estimate : [ 10 0.6 ] shrinkage

 L1 estimate : [ 10.2 0 ] sparsity

 Mini-conclusion :
 L2 optimization is fast, L1 tends to be slower. If you have the
computational resources, try both (at the same time) !
 ALWAYS run logistic regression with at least some
regularization.
 Corollary : ALWAYS run logistic regression on features that
have been standardized (zero mean, unit variance)
4/15/11 31

So far …
 Logistic regression
 Model
 Inference
 Learning via maximum likelihood
 L1 and L2 regularization

Next …. SVMs !

4/15/11 32

Why did we use probability again?
 Aim : Maximize “accuracy”

 Logistic regression : Indirect method that maximizes
likelihood of data.

 A much more direct approach is to directly maximize
accuracy.

Support Vector Machines (SVMs)

4/15/11 33

Maximize the margin
Maximize the margin

! 2005-2007 Carlos Guestrin "

4/15/11 34

Geometry review
Y=1 2x1+x2-2=0

Y= -1

For a point on the line :
(0.5, 1 ) : d = 2*0.5 + 1 – 2 =0

Signed “distance” to the line from (x10, x20)
d = 2x10 + x20 - 2 4/15/11 35

Geometry review
Y=1 2x1+x2-2=0

Y= -1

(1, 2.5) : d = 2*1 + 2.5 - 2 = 2.5 > 0
y(wx+b) = 1*2.5 = 2.5 > γ

4/15/11 36

Geometry review
Y=1 2x1+x2-2=0

Y= -1

(0.5, 0.5) : d = 2*0.5 + 0.5 – 2 = -0.5 < 0
y(w.x+b) = y*d = -1 * -0.5 = 0.5

4/15/11 37

Support Vector Machines
Normalized margin – Canonical
hyperplanes
! 2005-2007 Carlos Guestrin "

Support vectors are the
x+ points touching the margins.
x-

4/15/11 38
! 2005-2007 Carlos Guestrin !

= !j w(j) x(j)

w.x = !j w(j) x(j)
w.x = !j w(j) x(j)

Slack variables ! 2005-2007 Carlos Guestrin

 SVMs are made robust by adding “slack variables” that
allow training error to be non-zero
ximize the margin point. Slack variable ==0 for correctly
 One for each data
Maximizepoints. margin
classified the
Maximize the−C∑ξ i
max γ margin

€

4/15/11 39

Slack variables
max γ −C ∑ξ i

€
Need to tune C :
high C == minimize mis-classifications
low C == maximize margin

4/15/11 40

SVM summary
 Model : w.x + b > 0 if y = +1
w.x + b < 0 if y = -1

 Inference : ŷ = sign(w.x+b)

 Learning : Maximize { (margin) - C ( slack-variables) }

Next … Kernel SVMs

4/15/11 41

The kernel trick
 Why linear separator ? What if data looks like below ?

The kernel trick
allows you to use
SVMs with non-
linear separators.

Different kernels
1. Polynomial
2. Gaussian
3. exponential

4/15/11 42

Logistic Linear SVM

Error ~ 40% in both cases
4/15/11 43

Kernel SVM with polynomial kernel
of degree 2

Polynomial kernel of
degree 2/4 do very
well, but degree 3/5
do very bad.

Gaussian kernel has
tuning parameter
(bandwidth).
Performance
depends on picking
the right bandwith.
Error = 7% 4/15/11 44

SVMs summary
 Maximize the margin between positive and negative
examples.
 Kernel trick is widely implemented, allowing non-linear
decision surface.
 Not probabilistic 

 Software :
 SVM-light http://svmlight.joachims.org/,
 LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
 Weka, matlab, R

4/15/11 45

Demo

http://www.cs.technion.ac.il/~rani/LocBoost

4/15/11 46

Which to use ?
 Linear SVMs and logistic regression work very similar in
most cases.
 Kernelized SVMs work better than linear SVMs (mostly)
 Kernelized logistic regression is possible, but
implementations are not available easily.

4/15/11 47

Recommendations
1. First, try logistic regression. Easy, fast, stable. No “tuning”
parameters.
2. Equivalently, you can first try linear SVMs, but you need
to tune “C”
3. If results are “good enough”, stop.
4. Else try SVMs with Gaussian kernels.
Need to tune bandwidth, C – by using validation data.

If you have more time/computational resources, try random
forests as well.

** Recommendations are opinions of the presenter, and not known facts.

4/15/11 48

In conclusion …

 Logistic Regression
 Support Vector Machines

Other classification approaches …

 Random forests / decision trees
 Naïve Bayes
 Nearest Neighbors
 Boosting (Adaboost)
4/15/11 49

Thank you
Questions?

4/15/11 50

Is this athlete doing drugs ?
 X = Blood-test-to-detect-drugs
 Y = Doped athlete ?

 Two types of errors :
 Athlete is doped, we predict “NO” : false negative
 Athlete is NOT doped, we predict “YES” : false positive

 Penalize false positives more than false negatives

4/15/11 52

Outline
 What is classification ?
 Parameters, data, inference, learning
 Predicting coin tosses (0-dimensional X)
 Logistic Regression
 Predicting “speaker success” (1-dimensional X)
 Formulation, optimizatiob
 Decision surface is linear
 Interpreting coefficients
 Hypothesis testing
 Evaluating the performance of the model
 Why is it called “regression” : log-odds
 L2 regularization
 Patient survival (d-dimensional X)
 L1 regularization
 Support Vector Machines
 Linear SVMs + formulation
 What are “support vectors”
 The kernel trick
 Demo : logistic regression v/s SVMs v/s kernel tricks 4/15/11 53

Overfitting a more serious problem

2x+y-2 = 0 w = [2 1 -2]
4x+2y-4 = 0 w = [4 2 -4]
400x+200y-400 = 0 w = [400 200 -400]

 Absolutely need L2 regularization 4/15/11 54

Intro to Classification: Logistic Regression & SVM

More Related Content

What's hot

Similar to Intro to Classification: Logistic Regression & SVM

More from NYC Predictive Analytics

Recently uploaded

Intro to Classification: Logistic Regression & SVM