Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

758 views

Published on

The presentation material for the reading club of Element of Statistical Learning by Hastie et al.

The contents of the sections cover
- Properties of logistic regression compared to least square s fitting
- Difference between logistic regression vs. linear discriminant analysis
- Rosenblatt's perceptron algorithm
- Derivation of optimal hyperplane, which offers the basis for SVM

-------------------------------------------------------------------------

研究室での『統計学習の基礎』(Hastieら著)の輪講用発表資料(ぜんぶ英語)です。

担当範囲は
・最小二乗法との類推で見るロジスティック回帰の特徴
・ロジスティック回帰と線形判別分析の比較
・ローゼンブラットのパーセプトロンアルゴリズム
・SVMの基礎となる最適分離超平面の導出

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

  1. 1. ESL 4.4.3-4.5 Logistic Regression (contd.) & Separating Hyperplane June 8, 2015 Talk by Shinichi TAMURA Mathematical Informatics Lab @ NAIST
  2. 2. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  3. 3. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  4. 4. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  5. 5. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  6. 6. On the analogy with Least Squares Fitting [Review] Fitting LR Model Parameters are fitted by ML estimation, using Newton-Raphson algorithm:" " " " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz.
  7. 7. On the analogy with Least Squares Fitting [Review] Fitting LR Model Parameters are fitted by ML estimation, using Newton-Raphson algorithm:" " " " It looks like least squares fitting:" βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz. β ← rg min β (y − Xβ) (y − Xβ) β =(X X)−1 X y
  8. 8. On the analogy with Least Squares Fitting Self-consistency β depends on W and z, while W and z depend on β." " " " " " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz.
  9. 9. On the analogy with Least Squares Fitting Self-consistency β depends on W and z, while W and z depend on β." " " " " " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz. z = ˆβ + y − ˆp ˆp(1 − ˆp)  = ˆp(1 − ˆp).
  10. 10. On the analogy with Least Squares Fitting Self-consistency β depends on W and z, while W and z depend on β." " " " " → “self-consistent” equation, needs iterative method to solve" " βnew ← rg min β (z − Xβ) W(z − Xβ) βnew =(X WX)−1 X Wz.
  11. 11. On the analogy with Least Squares Fitting Meaning of Weighted RSS (1) RSS is used to check the goodness of fit in least squares fitting." " " N =1 (y − ˆp)2
  12. 12. On the analogy with Least Squares Fitting Meaning of Weighted RSS (1) RSS is used to check the goodness of fit in least squares fitting." " " " How about weighted RSS in logistic regression?" N =1 (y − ˆp)2 N =1 (y − ˆp)2 ˆp(1 − ˆp)
  13. 13. On the analogy with Least Squares Fitting Meaning of Weighted RSS (2) Weighted RSS is interpreted as..." Peason's χ-squared statistics" χ2 = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (1 − ˆp + ˆp)(y − ˆp)2 ˆp(1 − ˆp) = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  14. 14. On the analogy with Least Squares Fitting Meaning of Weighted RSS (2) Weighted RSS is interpreted as..." Peason's χ-squared statistics" χ2 = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (1 − ˆp + ˆp)(y − ˆp)2 ˆp(1 − ˆp) = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  15. 15. On the analogy with Least Squares Fitting Meaning of Weighted RSS (2) Weighted RSS is interpreted as..." Peason's χ-squared statistics" χ2 = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (1 − ˆp + ˆp)(y − ˆp)2 ˆp(1 − ˆp) = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  16. 16. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  17. 17. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . Maximum  likelihood   of  the  model  
  18. 18. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . Maximum  likelihood   of  the  model   Likelihood  of  the  full  model   which  achieve  perfect  fitting  
  19. 19. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . 00  
  20. 20. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . 00    log   = ( − ) + ( − )2 2 − ( − )3 62 + · · ·
  21. 21. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) . 00  
  22. 22. On the analogy with Least Squares Fitting Meaning of Weighted RSS (3) or... as quadratic approximation of Deviance" D = −2 N =1 [y log ˆp + (1 − y) log(1 − ˆp)] − N =1 [y log y + (1 − y) log(1 − y)] = 2 N =1 y log y ˆp + (1 − y) log 1 − y 1 − ˆp ≈ 2 N =1 (y − ˆp) + (y − ˆp)2 2 ˆp + {(1 − y) − (1 − ˆp)} + {(1 − y) − (1 − ˆp)}2 2(1 − ˆp) = N =1 (y − ˆp)2 ˆp + (y − ˆp)2 1 − ˆp = N =1 (y − ˆp)2 ˆp(1 − ˆp) .
  23. 23. On the analogy with Least Squares Fitting Asymp. distribution of The distribution of converges to "N β, (X WX)−1ˆβ ˆβ
  24. 24. On the analogy with Least Squares Fitting Asymp. distribution of The distribution of converges to " (See hand-out for the details)" N β, (X WX)−1 y i.i.d. ∼ Bern(Pr(; β)). ∴ E[y] = p, vr[y] = W. ∴ E ˆβ = E (X WX)−1 X Wz = (X WX)−1 X WE Xβ + W−1 (y − p) = (X WX)−1 X WXβ = β, vr ˆβ = (X WX)−1 X Wvr Xβ + W−1 (y − p) W X(X WX)− = (X WX)−1 X W(W−1 WW− )W X(X WX)− = (X WX)−1 . ˆβ ˆβ
  25. 25. On the analogy with Least Squares Fitting Test of models for LR Once a model is obtained, Wald test or Rao's score test can be used to decide which term to drop/add. It need no recalculation of IRLS." Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note by Kevin Andrew Rader, on Harvard College GSAS http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024
  26. 26. On the analogy with Least Squares Fitting Test of models for LR Once a model is obtained, Wald test or Rao's score test can be used to decide which term to drop/add. It need no recalculation of IRLS." Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note by Kevin Andrew Rader, on Harvard College GSAS http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024 Test  by  the  gradient   of  log-­‐likelihood  
  27. 27. On the analogy with Least Squares Fitting Test of models for LR Once a model is obtained, Wald test or Rao's score test can be used to decide which term to drop/add. It need no recalculation of IRLS." Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note by Kevin Andrew Rader, on Harvard College GSAS http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024 Test  by  the  difference   of  paremeter  
  28. 28. On the analogy with Least Squares Fitting L1-regularlized LR (1) Just like lasso, L1-regularlizer is effective for LR."
  29. 29. On the analogy with Least Squares Fitting L1-regularlized LR (1) Just like lasso, L1-regularlizer is effective for LR." Here the objective function will be:" " " " " mx β0,β N =1 log Pr(; β0, β) − λ β 1 = mx β0,β N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| .
  30. 30. On the analogy with Least Squares Fitting L1-regularlized LR (1) Just like lasso, L1-regularlizer is effective for LR." Here the objective function will be:" " " " " The resulting algorithm can be called “iterative reweighted lasso” algorithm." mx β0,β N =1 log Pr(; β0, β) − λ β 1 = mx β0,β N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| .
  31. 31. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0) 00  
  32. 32. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0) 00  
  33. 33. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0)
  34. 34. On the analogy with Least Squares Fitting L1-regularlized LR (2) By putting the gradient to 0, we get same score equation as lasso algorithm:" ∂ ∂βj N =1 y(β0 + β ) − log 1 + eβ0+β  − λ p j=1 |βj| = 0 ∴ N =1  yj − eβ0+β  1 + eβ0+β  − λ · sign(βj) = 0 ∴ xj (y − p) = λ · sign(βj) (where βj = 0) xj (y − Xβ) = λ · sign(βj) Score  equation  of  lasso  is    
  35. 35. On the analogy with Least Squares Fitting L1-regularlized LR (3) Since the objective function is concave, the solution can be obtained using optimization techniques." "
  36. 36. On the analogy with Least Squares Fitting L1-regularlized LR (3) Since the objective function is concave, the solution can be obtained using optimization techniques." " However, the profiles of coefficients are not piece- wise linear, and it is difficult to get the path." Predictor-Corrector method for convex optimization or coordinate descent algorithm will work in some situations."
  37. 37. On the analogy with Least Squares Fitting Summary LR is analogous to least squares fitting" " and..." •  LR requires iterative algorithm because of the self-consistency" •  Weighted RSS can be seen as χ-squared or deviance" •  The dist. of converges to " •  Rao's score test or Wald test is useful for model selection" •  L1-regularlized is analogous to lasso except for non-linearity" βnew = (X WX)−1 X Wz ↔ β = (X X)−1 X y N β, (X WX)−1ˆβ
  38. 38. Today's topics ¨  Logistic regression (contd.)" ¨  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  39. 39. Today's topics ¨  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  40. 40. Logistic regression vs. LDA What is the different LDA and logistic regression are very similar methods." Let us study the characteristics of these methods through the difference of formal aspects."
  41. 41. Logistic regression vs. LDA Form of the log-odds " " "
  42. 42. Logistic regression vs. LDA Form of the log-odds LDA" " " " log Pr(G = k|X = ) Pr(G = K|X = ) = log πk πK − 1 2 (μk + μK ) −1 (μk − μK ) +  −1 (μk − μK ) =αk0 + αk ,
  43. 43. Logistic regression vs. LDA Form of the log-odds LDA" " " " " Logistic regression" log Pr(G = k|X = ) Pr(G = K|X = ) = log πk πK − 1 2 (μk + μK ) −1 (μk − μK ) +  −1 (μk − μK ) =αk0 + αk , log Pr(G = k|X = ) Pr(G = K|X = ) =βk0 + βk .
  44. 44. Logistic regression vs. LDA Form of the log-odds LDA" " " " " Logistic regression" log Pr(G = k|X = ) Pr(G = K|X = ) = log πk πK − 1 2 (μk + μK ) −1 (μk − μK ) +  −1 (μk − μK ) =αk0 + αk , log Pr(G = k|X = ) Pr(G = K|X = ) =βk0 + βk . Same  form  
  45. 45. Logistic regression vs. LDA Criteria of estimations " "
  46. 46. Logistic regression vs. LDA Criteria of estimations LDA" " " " " mx N =1 log Pr(G = g, X = ) = mx N =1 log Pr(G = g|X = ) log Pr(X = )
  47. 47. Logistic regression vs. LDA Criteria of estimations LDA" " " " " Logistic regression" mx N =1 log Pr(G = g, X = ) = mx N =1 log Pr(G = g|X = ) log Pr(X = ) mx N =1 log Pr(G = g|X = )
  48. 48. Logistic regression vs. LDA Criteria of estimations LDA" " " " " Logistic regression" mx N =1 log Pr(G = g, X = ) = mx N =1 log Pr(G = g|X = ) log Pr(X = ) mx N =1 log Pr(G = g|X = ) Marginal  likelihood  
  49. 49. Logistic regression vs. LDA Form of the Pr(X) " "
  50. 50. Logistic regression vs. LDA Form of the Pr(X) LDA" " " Pr(X) = K k=1 πkϕ(X; μk, ).
  51. 51. Logistic regression vs. LDA Form of the Pr(X) LDA" " " " Logistic regression" Pr(X) = K k=1 πkϕ(X; μk, ). Arbitrary  Pr(X)  
  52. 52. Logistic regression vs. LDA Form of the Pr(X) LDA" " " " Logistic regression" Pr(X) = K k=1 πkϕ(X; μk, ). Arbitrary  Pr(X)   Involves  parameters  
  53. 53. Logistic regression vs. LDA Effects of the difference (1) How these formal difference affect on the character of the algorithm?"
  54. 54. Logistic regression vs. LDA Effects of the difference (2) The assumption of Gaussian and homoscedastic can be strong constraint, which lead low variance."
  55. 55. Logistic regression vs. LDA Effects of the difference (2) The assumption of Gaussian and homoscedastic can be strong constraint, which lead low variance." In addition, LDA has the advantage that it can make use of unlabelled observations; i.e. semi- supervised is available."
  56. 56. Logistic regression vs. LDA Effects of the difference (2) The assumption of Gaussian and homoscedastic can be strong constraint, which lead low variance." In addition, LDA has the advantage that it can make use of unlabelled observations; i.e. semi- supervised is available." " On the other hand, LDA could be affected by outliers."
  57. 57. Logistic regression vs. LDA Effects of the difference (3) With linear separable data,"
  58. 58. Logistic regression vs. LDA Effects of the difference (3) With linear separable data," •  The coefficients of LDA is defined well; but training error may occur."
  59. 59. Logistic regression vs. LDA Effects of the difference (3) With linear separable data," •  The coefficients of LDA is defined well; but training error may occur." •  The coefficients of LR can be infinite; but true separating hyperplane can be found"
  60. 60. Logistic regression vs. LDA Effects of the difference (3) With linear separable data," •  The coefficients of LDA is defined well; but training error may occur." •  The coefficients of LR can be infinite; but true separating hyperplane can be found" Do  not  think  too  much  on  training  error;   what  is  important  is    generalization  error    
  61. 61. Logistic regression vs. LDA Effects of the difference (4) The assumptions for LDA rarely hold in practical."
  62. 62. Logistic regression vs. LDA Effects of the difference (4) The assumptions for LDA rarely hold in practical." Nevertheless, it is known empirically that these models give quite similar results, even when LDA is used inappropriately, say with qualitative variables." "
  63. 63. Logistic regression vs. LDA Effects of the difference (4) The assumptions for LDA rarely hold in practical." Nevertheless, it is known empirically that these models give quite similar results, even when LDA is used inappropriately, say with qualitative variables." " After all, however, if Gaussian assumption looks to hold, use LDA. Otherwise, use logistic regression."
  64. 64. Today's topics ¨  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" ¨  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  65. 65. Today's topics ¨  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  66. 66. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  67. 67. Separating Hyperplane: Overview Another way of Classification Both LDA and LR do classification through the probabilities using regression models." "
  68. 68. Separating Hyperplane: Overview Another way of Classification Both LDA and LR do classification through the probabilities using regression models." " Classification can be done by more explicit way: modelling the decision boundary directly."
  69. 69. Separating Hyperplane: Overview Properties of vector algebra Let L be the affine set defined by" " " " " and the signed distance from x to L is " β0 + β  = 0 d± (, L) = 1 β (β  + β0) β0 + β  > 0 ⇔  is above L β0 + β  = 0 ⇔  is on L β0 + β  < 0 ⇔  is below L
  70. 70. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  71. 71. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" p  Separating Hyperplane" p  Rosenblatt's Perceptron" p  Optimal Hyperplane"
  72. 72. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) 00  
  73. 73. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) 00  
  74. 74. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0)
  75. 75. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) If  misclassified  yi=1  as  -­‐1,   the  latter  part  is  negative  
  76. 76. Rosenblatt's Perceptron Learning Criteria The basic criteria of Rosenblatt's Perceptron learning algorithm is to reduce (M is misclassified data)" D(β, β0) = ∈M  β + β0 ∝ ∈M d± (, L) = − ∈M y( β + β0) If  misclassified  yi=-­‐1  as  1,   the  latter  part  is  positive  
  77. 77. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning."
  78. 78. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning."
  79. 79. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning." Observations  classified  correctly   do  not  affects  the  parameter,  so   it  is  robust  to  outliers.  
  80. 80. Rosenblatt's Perceptron Learning Algorithm (1) Instead of reducing D by batch learning, “stochastic” gradient descent algorithm is adopted. " The coefficients are updated for each misclassified observations like online learning." " Thus, coefficients will be updated based not on D but on single"D(β, β0) = −y( β + β0)
  81. 81. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:"
  82. 82. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it"
  83. 83. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it" 2.  If the classification was wrong, update coefficients" ∂D(β, β0) ∂β = − y, ∂D(β, β0) ∂β0 = − y. ∴ β β0 ← β β0 +ρ y y00  
  84. 84. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it" 2.  If the classification was wrong, update coefficients" ∂D(β, β0) ∂β = − y, ∂D(β, β0) ∂β0 = − y. ∴ β β0 ← β β0 +ρ y y
  85. 85. Rosenblatt's Perceptron Learning Algorithm (2) Proceedings of the algorithm is as follows:" 1.  Take 1 observation xi and classify it" 2.  If the classification was wrong, update coefficients" ∂D(β, β0) ∂β = − y, ∂D(β, β0) ∂β0 = − y. ∴ β β0 ← β β0 +ρ y y Learning  rate   Can  be  set  to  1  without   loss  of  generality  
  86. 86. Rosenblatt's Perceptron Learning Algorithm (3) Updating parameter may lead misclassifications of other correctly-classified observations." Therefore, although each update reduces each Di , it can increase total D."
  87. 87. Rosenblatt's Perceptron Learning Algorithm (3) Updating parameter may lead misclassifications of other correctly-classified observations." Therefore, although each update reduces each Di , it can increase total D."
  88. 88. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates." "
  89. 89. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates."
  90. 90. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates." " However, in practical, it is difficult to know if" •  the data is not linear separable and never converge" •  or the data is linear separable but time-consuming" " "
  91. 91. Rosenblatt's Perceptron Convergence Theorem If data is linear separable learning of perceptron terminates in finite steps. " Otherwise, learning never terminates." " However, in practical, it is difficult to know if" •  the data is not linear separable and never converge" •  or the data is linear separable but time-consuming" " In addition, the solution is not unique depending on the initial value or data order." "
  92. 92. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" ¨  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  93. 93. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" þ  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  94. 94. Optimal Hyperplane Derivation of KKT cond. (1) This section could be hard for some audience." " To make story bit clearer, let us study general on optimization problem. The theme is:"
  95. 95. Optimal Hyperplane Derivation of KKT cond. (1) This section could be hard for some audience." " To make story bit clearer, let us study general on optimization problem. The theme is:" Duality and KKT condition for optimization problem"
  96. 96. Optimal Hyperplane Derivation of KKT cond. (2) Suppose we have an optimization problem:" " " " and let the feasible region be" minimize ƒ() subject to g() ≤ 0 C = {|g() ≤ 0}
  97. 97. Optimal Hyperplane Derivation of KKT cond. (3) On the region of optimization, relaxation is the technique often used to make problem easier." "
  98. 98. Optimal Hyperplane Derivation of KKT cond. (3) On the region of optimization, relaxation is the technique often used to make problem easier." " Lagrange relaxation, as done below, is one of that:" " minimize L(, y) = ƒ() +  yg() subject to y ≥ 0.
  99. 99. Optimal Hyperplane Derivation of KKT cond. (4) Concerning to the L(x,y), following inequality holds:" " " and it requires yi or gi(x) to be equal to zero for all i (this condition is called “complementary slackness” )." min ∈C ƒ() = min  sp y≥0 L(, y) ≥ mx y≥0 inf  L(, y)
  100. 100. Optimal Hyperplane Derivation of KKT cond. (4) Concerning to the L(x,y), following inequality holds:" " " and it requires yi or gi(x) to be equal to zero for all i (this condition is called “complementary slackness” )." min ∈C ƒ() = min  sp y≥0 L(, y) ≥ mx y≥0 inf  L(, y)
  101. 101. Optimal Hyperplane Derivation of KKT cond. (4) Concerning to the L(x,y), following inequality holds:" " " and it requires yi or gi(x) to be equal to zero for all i (this condition is called “complementary slackness” )." " According to the inequality, maximizing infx L(x,y) tells us the lower boundary for the original problem." min ∈C ƒ() = min  sp y≥0 L(, y) ≥ mx y≥0 inf  L(, y)
  102. 102. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0
  103. 103. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0 Condition  to   achieve  inf  L(x,y)    
  104. 104. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " This is called “Wolfe dual problem”, and strong duality theory says the solutions for the primal and dual problem are equivalent." mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0
  105. 105. Optimal Hyperplane Derivation of KKT cond. (5) Therefore, we have the following maximizing problem:" " " " " This is called “Wolfe dual problem”, and strong duality theory says the solutions for the primal and dual problem are equivalent." mximize L(, y) subject to ∂ ∂ L(, y) = 0 y ≥ 0
  106. 106. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0
  107. 107. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint  
  108. 108. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition  
  109. 109. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition   Dual  constraint  
  110. 110. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition   Dual  constraint   Complementary  slackness  
  111. 111. Optimal Hyperplane Derivation of KKT cond. (6) Thus, optimal solution must satisfy the conditions so far. They are called the “KKT condition” altogether."    g() ≤ 0 ∂ ∂ L(, y) = 0 y ≥ 0 yg() = 0 Primal  constraint   Stationary  condition   Dual  constraint   Complementary  slackness  
  112. 112. Optimal Hyperplane KKT for Opt. Hyperplane (1) We learned about the KKT conditions." " Then, get back to the original problem: finding optimal hyperplane."
  113. 113. Optimal Hyperplane KKT for Opt. Hyperplane (2) The original fitting criteria of the optimal hyperplane is what is generalized of perceptron:" " mximize β,β0 M subject to β = 1 y( β + β0) ≥ M ( = 1, . . . , N)
  114. 114. Optimal Hyperplane KKT for Opt. Hyperplane (2) The original fitting criteria of the optimal hyperplane is what is generalized of perceptron:" " mximize β,β0 M subject to β = 1 y( β + β0) ≥ M ( = 1, . . . , N) Criteria  of  maximizing  margin  is   theoretically  supported  using   distributions  with  no  assumption  
  115. 115. Optimal Hyperplane KKT for Opt. Hyperplane (3) This is kind of mini-max problem which is difficult to solve, so convert it into more easier problem:" " " "
  116. 116. Optimal Hyperplane KKT for Opt. Hyperplane (3) This is kind of mini-max problem which is difficult to solve, so convert it into more easier problem:" " " " (See hand-out for the detailed transformation)" minimize β,β0 1 2 β 2 subject to y( β + β0) ≥ 1 ( = 1, . . . , N)
  117. 117. Optimal Hyperplane KKT for Opt. Hyperplane (3) This is kind of mini-max problem which is difficult to solve, so convert it into more easier problem:" " " " (See hand-out for the detailed transformation)" " This is quadratic programming problem." minimize β,β0 1 2 β 2 subject to y( β + β0) ≥ 1 ( = 1, . . . , N)
  118. 118. Optimal Hyperplane KKT for Opt. Hyperplane (4) To make use of KKT condition, let's make object function into Lagrange function:" Lp = 1 2 β 2 − N =1 α y( β + β0) − 1
  119. 119. Optimal Hyperplane KKT for Opt. Hyperplane (5) Thus, the KKT condition is:"    y( β + β0) ≥ 1 ( = 1, . . . , N), β = N =1 αy, 0 = N =1 αy, α ≥ 0 ( = 1, . . . , N), α y( β + β0) − 1 = 0 ( = 1, . . . , N),
  120. 120. Optimal Hyperplane KKT for Opt. Hyperplane (5) Thus, the KKT condition is:" " " " " " " " " Solution is obtained by solving this."    y( β + β0) ≥ 1 ( = 1, . . . , N), β = N =1 αy, 0 = N =1 αy, α ≥ 0 ( = 1, . . . , N), α y( β + β0) − 1 = 0 ( = 1, . . . , N),
  121. 121. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  122. 122. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  123. 123. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  124. 124. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  125. 125. Optimal Hyperplane Support points (1) The KKT condition tell us" " " " Those points on the edge of the slab is called “support points” (or “support vectors” )." α > 0 ⇔ y( β + β0) = 1 ⇔  α = 0 ⇔ y( β + β0) > 1 ⇔  is  on  edge  of  slab   is  off  edge  of  slab  
  126. 126. Optimal Hyperplane Support points (2) β can be written as the linear combination of the support points:" " " " where S is the indices of the support points." β = N =1 αy = ∈S αy,
  127. 127. Optimal Hyperplane Support points (3) β0 can be obtained after β is obtained. For i S" y( β + β0) = 1 ∴ β0 = 1/y − β  = y − j∈S αjyjj  ∴ β0 = 1 |S| ∈S y − j∈S αjyjj  00  
  128. 128. Optimal Hyperplane Support points (3) β0 can be obtained after β is obtained. For i S" y( β + β0) = 1 ∴ β0 = 1/y − β  = y − j∈S αjyjj  ∴ β0 = 1 |S| ∈S y − j∈S αjyjj  Took  average  to  avoid   computation  error  
  129. 129. Optimal Hyperplane Support points (4) All coefficients are defined only through support points:" " " " " thus, this is robust to outliers." " β = ∈S αy, β0 = 1 |S| ∈S y − j∈S αjyjj 
  130. 130. Optimal Hyperplane Support points (4) All coefficients are defined only through support points:" " " " " thus, this is robust to outliers." " However, do not forget that which will be support points is defined using all data points." β = ∈S αy, β0 = 1 |S| ∈S y − j∈S αjyjj 
  131. 131. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" þ  Rosenblatt's Perceptron" ¨  Optimal Hyperplane"
  132. 132. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" ¨  Separating Hyperplane" þ  Rosenblatt's Perceptron" þ  Optimal Hyperplane"
  133. 133. Today's topics þ  Logistic regression (contd.)" þ  On the analogy with Least Squares Fitting" þ  Logistic regression vs. LDA" þ  Separating Hyperplane" þ  Rosenblatt's Perceptron" þ  Optimal Hyperplane"
  134. 134. Summary LDA Logistic Regression Perceptron Optimal Hyperplane With linear separable data Training error may occur True separator found, but coef. may be infinite True separator found, but not unique Best separator found With non-linear separable data Work well Work well Algorithm never stop Not feasible With outliers Not robust Robust Robust Robust

×