Upcoming SlideShare
×

# Intro to Classification: Logistic Regression & SVM

27,021 views

Published on

A gentle introduction to 2 classification techniques, as presented by Kriti Puniyani to the NYC Predictive Analytics group (April 14, 2011). To download the file please go here: http://www.meetup.com/NYC-Predictive-Analytics/files/

Published in: Education
33 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No

Are you sure you want to  Yes  No
• i like the book

Are you sure you want to  Yes  No
Views
Total views
27,021
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
0
2
Likes
33
Embeds 0
No embeds

No notes for slide

### Intro to Classification: Logistic Regression & SVM

1. 1. Kriti PuniyaniCarnegie Mellon University kriti@cmu.edu
2. 2. About me  Graduate student at Carnegie Mellon University  Statistical machine learning   Topic models   Sparse network learning   Optimization  Domains of interest   Social media analysis   Systems biology   Genetics   Sentiment analysis   Text processing 4/15/11 2
3. 3. Machine learning  Computers to “learn with experience”  Learn : to be able to predict “unseen” things.  Many applications   Search   Machine translation   Speech recognition   Vision : identify cars, people, sky, apples   Robot control  Introductions :   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf   http://videolectures.net/mlss2010_lawrence_mlfcs/ 4/15/11 3
4. 4. Classification  Is this the digit “9” ? ρ  Will this patient survive ?  Will this user click on my ad ? 4/15/11 4
5. 5. Predict the next coin toss DataTask THTTTTHHTHTHTTT Model 1 : Model 2 : Coin is tossed with Toss depends on wind probability p (of condition W, starting being tails) pose S, torque T Parameters 4/15/11 5
6. 6. Predict the next coin toss THTTTTHHTHTHTTT Learning Model 2 : Model 1 : W=12.2, S=1, p=2/3 T=0.23 4/15/11 6
7. 7. Predict the next coin toss I predict the next toss to be T Inference Model 2 : Model 1 : W=12.2, S=1, P=2/3 T=0.23 4/15/11 7
8. 8. Inference  Parameter : p=2/3  Predicted next 9 tosses ….H H H T T T T T T Observed next 9 tosses ….T T T T T T H H H Accuracy = 2/9   Predicted next 9 tosses ….T T T T T T T T T Observed next 9 tosses ….T T T T T T H H H Accuracy = 6/9   Inference rule :   if p > 0.5, always predict T,   if p < 0.5 always predict H. 4/15/11 8
9. 9. The anatomy of classification 1.  What is the data (features X, label y) ? ★★★ 2.  What is the model ? Model parameterization (w) 3.  Inference : Given X, w, predict the label. 4.  Learning : Given (X,y) pairs, learn the “best” w   Define “best” – maximize an objective function(X,Y) pairs train time Learning w test time (X, ? ) Inference predicted Y 9
10. 10. Logistic Regression 4/15/11 10
11. 11. Predict speaker success  X = Number of hours spent in preparation  Y = Was the speaker “good”? I(a) = 1 if(a==TRUE)Prediction : Y = I ( X > h) = 0 if(a==FALSE) 4/15/11 11
12. 12. Predict speaker success Y = I ( X > h)  Learning is difficult.  Not robust 4/15/11 12
13. 13. 1 P(Y | w, X) = 1+e−wX +w0€ Y = I ( X > 10) 4/15/11 13
14. 14. Logistic (sigmoidal) function 4/15/11 14
15. 15. Extend to d dimensions   1 P(Y | w, X ) = −( w1 X 1 +w 2 X 2 +...+w d X d +w 0 ) 1+e   1 P(Y | w, X ) =  € 1+e −( w. X +w 0 ) 4/15/11 15 €
16. 16. Logistic regression   Model parameter : w 1 P(Y = 1 | w, X) = −wX +w 0 1+e   Example : Given X = 0.9 , w = 1.2 => wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75€ Toss a coin with p=3/4   Example : Given X = -1.1 , w = 1.2 => wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2 Toss a coin with p=1/5 4/15/11 16
17. 17. Another view of logistic regression  Log odds : ln [ p/(1-p) ] = wX + ε  p / (1-p) = ewX  p = (1-p) ewX  p (1 + ewX) = ewX  p = ewX / (1 + ewX) = 1/(e-wX+1)  Logistic regression is a “linear regression” between log- odds of an event and features (X) 4/15/11 17
18. 18. The anatomy of classification1.  What is the data (features X, label y) ? ✔2.  What is the model ? Model parameterization (w) ✔3.  Inference :Given X, w, predict the label. ✔4.  Learning : Given (X,y) pairs, learn the “best” w   Define “best” – maximize an objective function 4/15/11 18
19. 19. Learning : Finding the best wExpressing…(X , Y )   Data : (X , Y ), 1 1 Conditional Log Likelihood n n   If yi == 1, max P(yi=1| xi, w)   If yi == 0, max P(yi=0| xi, w)   Maximize Log-likelihood 4/15/11 19
20. 20. Learning : Example   Data : (5, 0), (11, 1), (25,1) l(w)= ln P(y = 0 | x = 5,w) + ln P(y = 1 | x = 11,w) + ln P(y = 1 | x = 25,w)   P(Y=1|X,w) is a logistic function 1 1 ©Carlos Guestrin 2005-2009 1 ! l(w)= ln(1− −5w +w 0 ) + ln + ln 1+ e 1+ e−11w +w 0 1+ e−25w +w 0 P(y=1|x) + P(y=0|x) = 1 4/15/11 20€
21. 21. Optimization : Pick the “best” w1.  Weka2.  Matlab : w = mnrfit(X,Y)3.  R : w <- glm(Y~X, family=binomial(link="logit"))4.  IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m5.  Implement your own  4/15/11 21
22. 22. Decision surface is linear Errors Y=0 Y=1 4/15/11 22
23. 23. Decision surface is linearhttp://www.cs.technion.ac.il/~rani/LocBoost/ 4/15/11 23
24. 24. So far..  Logistic regression is a binary classifier (multinomial version exists)  P(Y=1|X,w) is a logistic function  Inference : Compute P(Y=1|X,w), and do “rounding”.  Parameter learnt by maximizing log-likelihood of data.  Decision surface is linear (kernelized version exists) 4/15/11 24
25. 25. Improvements in the model  Prevent over-fitting Regularization  Maximize accuracy directly SVMs  Non-linear decision surface Kernel Trick  Multi-label data 4/15/11 25
26. 26. Occam’s razorThe simplest explanation is most likely the correctone 4/15/11 26
27. 27. New and improved learning   “Best” w == maximize log-likelihood Maximum Log-likelihood Estimate (MLE) Small concern … over-fittingIf data islinearlyseparable,w 4/15/11 27
28. 28. L2 regularization 2 || w ||2 = ∑ wi i 2 max w l(w) − λ || w || 2  Prevents over-fitting €  “Pushes” parameters towards zero  Equivalent to a prior on€ the parameters   Normal distribution (0 mean, unit covariance) λ : tuning parameter ( 0.1) 4/15/11 28
29. 29. Patient Diagnosis  Y = disease  X = [age, weight, BP, blood sugar, MRI, genetic tests …]  Don’t want all “features” to be relevant.  Weight vector w should be “mostly zeros”. 4/15/11 29
30. 30. L1 regularization || w ||1= ∑ | w i | i max w l(w) − λ || w ||1 €  Prevents over-fitting  “Pushes” parameters to zero  Equivalent to a prior on€ the parameters   Laplace distribution λ increases, more zeros (irrelevant) features 4/15/11 30
31. 31. L1 v/s L2 example  MLE estimate : [ 11 0.8 ]  L2 estimate : [ 10 0.6 ] shrinkage  L1 estimate : [ 10.2 0 ] sparsity  Mini-conclusion :   L2 optimization is fast, L1 tends to be slower. If you have the computational resources, try both (at the same time) !   ALWAYS run logistic regression with at least some regularization.   Corollary : ALWAYS run logistic regression on features that have been standardized (zero mean, unit variance) 4/15/11 31
32. 32. So far …  Logistic regression   Model   Inference   Learning via maximum likelihood   L1 and L2 regularization Next …. SVMs ! 4/15/11 32
33. 33. Why did we use probability again?  Aim : Maximize “accuracy”  Logistic regression : Indirect method that maximizes likelihood of data.  A much more direct approach is to directly maximize accuracy. Support Vector Machines (SVMs) 4/15/11 33
34. 34. Maximize the marginMaximize the margin ! 2005-2007 Carlos Guestrin " 4/15/11 34
35. 35. Geometry review Y=1 2x1+x2-2=0 Y= -1For a point on the line :(0.5, 1 ) : d = 2*0.5 + 1 – 2 =0Signed “distance” to the line from (x10, x20) d = 2x10 + x20 - 2 4/15/11 35
36. 36. Geometry review Y=1 2x1+x2-2=0 Y= -1(1, 2.5) : d = 2*1 + 2.5 - 2 = 2.5 > 0 y(wx+b) = 1*2.5 = 2.5 > γ 4/15/11 36
37. 37. Geometry review Y=1 2x1+x2-2=0 Y= -1(0.5, 0.5) : d = 2*0.5 + 0.5 – 2 = -0.5 < 0 y(w.x+b) = y*d = -1 * -0.5 = 0.5 4/15/11 37
38. 38. Support Vector MachinesNormalized margin – Canonicalhyperplanes ! 2005-2007 Carlos Guestrin " Support vectors are the x+ points touching the margins. x- 4/15/11 38 ! 2005-2007 Carlos Guestrin !
39. 39. = !j w(j) x(j) ! 2005-2007 Carlos Guestrin ! w.x = !j w(j) x(j) w.x = !j w(j) x(j) ! 2005-2007 Carlos Guestrin ! Slack variables ! 2005-2007 Carlos Guestrin   SVMs are made robust by adding “slack variables” that allow training error to be non-zeroximize the margin point. Slack variable ==0 for correctly   One for each data Maximizepoints. margin classified the Maximize the−C∑ξ i max γ margin € 4/15/11 39
40. 40. Slack variables max γ −C ∑ξ i€Need to tune C : high C == minimize mis-classifications low C == maximize margin 4/15/11 40
41. 41. SVM summary  Model : w.x + b > 0 if y = +1 w.x + b < 0 if y = -1  Inference : ŷ = sign(w.x+b)  Learning : Maximize { (margin) - C ( slack-variables) } Next … Kernel SVMs 4/15/11 41
42. 42. The kernel trick  Why linear separator ? What if data looks like below ? The kernel trick allows you to use SVMs with non- linear separators. Different kernels 1.  Polynomial 2.  Gaussian 3.  exponential 4/15/11 42
43. 43. Logistic Linear SVM Error ~ 40% in both cases 4/15/11 43
44. 44. Kernel SVM with polynomial kernelof degree 2 Polynomial kernel of degree 2/4 do very well, but degree 3/5 do very bad. Gaussian kernel has tuning parameter (bandwidth). Performance depends on picking the right bandwith. Error = 7% 4/15/11 44
45. 45. SVMs summary  Maximize the margin between positive and negative examples.  Kernel trick is widely implemented, allowing non-linear decision surface.  Not probabilistic   Software :   SVM-light http://svmlight.joachims.org/,   LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/   Weka, matlab, R 4/15/11 45
46. 46. Demo http://www.cs.technion.ac.il/~rani/LocBoost 4/15/11 46
47. 47. Which to use ?  Linear SVMs and logistic regression work very similar in most cases.  Kernelized SVMs work better than linear SVMs (mostly)  Kernelized logistic regression is possible, but implementations are not available easily. 4/15/11 47
48. 48. Recommendations1.  First, try logistic regression. Easy, fast, stable. No “tuning” parameters.2.  Equivalently, you can first try linear SVMs, but you need to tune “C”3.  If results are “good enough”, stop.4.  Else try SVMs with Gaussian kernels. Need to tune bandwidth, C – by using validation data.If you have more time/computational resources, try random forests as well.** Recommendations are opinions of the presenter, and not known facts. 4/15/11 48
49. 49. In conclusion …   Logistic Regression   Support Vector Machines Other classification approaches …   Random forests / decision trees   Naïve Bayes   Nearest Neighbors   Boosting (Adaboost) 4/15/11 49
50. 50. Thank youQuestions? 4/15/11 50
51. 51. Kriti PuniyaniCarnegie Mellon University kriti@cmu.edu
52. 52. Is this athlete doing drugs ?  X = Blood-test-to-detect-drugs  Y = Doped athlete ?  Two types of errors :   Athlete is doped, we predict “NO” : false negative   Athlete is NOT doped, we predict “YES” : false positive  Penalize false positives more than false negatives 4/15/11 52
53. 53. Outline  What is classification ?   Parameters, data, inference, learning   Predicting coin tosses (0-dimensional X)  Logistic Regression   Predicting “speaker success” (1-dimensional X)   Formulation, optimizatiob   Decision surface is linear   Interpreting coefficients   Hypothesis testing   Evaluating the performance of the model   Why is it called “regression” : log-odds   L2 regularization   Patient survival (d-dimensional X)   L1 regularization  Support Vector Machines   Linear SVMs + formulation   What are “support vectors”   The kernel trick  Demo : logistic regression v/s SVMs v/s kernel tricks 4/15/11 53
54. 54. Overfitting a more serious problem2x+y-2 = 0 w = [2 1 -2]4x+2y-4 = 0 w = [4 2 -4]400x+200y-400 = 0 w = [400 200 -400] Absolutely need L2 regularization 4/15/11 54