Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.                               Upcoming SlideShare
×

Additive model and boosting tree

2,453 views

Published on

this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)

Published in: Technology, Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here • Be the first to comment

Additive model and boosting tree

1. 1. Machine  Learning  Workshop   guodong@hulu.com   Machine  learning  introduc8on   Logis8c  regression   Feature  selec8on   Addi\$ve  Model  and  Boos\$ng  Tree     See  more  machine  learning  post:  h<p://dongguo.me
2. 2. Machine  learning  problem   •  Goal  of  machine  learning  problem   –  Based  on  observed  samples,  ﬁnd  a  predic8on   func8on(mapping  input  variables  space  to  response  value   space),  which  has  predic8on  ability  on  unseen  samples   •  Minimize  risk   Rexp ( f ) = EP [ L(Y , f ( X ))] = ∫ L( y, f ( x)) P( x, y)dxdy 1 Remp ( f ) = N N ∑ L( y , f ( x )) i =1 i i
3. 3. Components  of  machine  learning  ‘algorithm’   •  ML  =  Representa\$on  +  Strategy  +  Op\$miza\$on   –  Representa8on:  Change  func8on  op8miza8on  problem  to   parameter  op8miza8on  problem  by  choosing  a  family   space  for  predic8on  func8on;     –  Strategy:  Deﬁne  a  loss  func8on  to  evaluate  the  error   between  predic8on  value  and  response  value;   –  Op8miza8on:  Search  a  op8mal  predic8on  func8on  by   minimize  loss
4. 4. Representa8on   •  Determine  hypothesis  space  of  predic8on  func8on  by   choosing  a  ‘model’   –  E.g.  Linear  model,  mul8-­‐level  linear  model,  trees,  Bayesian   network,  addi8ve  model  and  so  on   –  Need  balance  expressive  and  generaliza8on  ability   •  Choose  the  model  with  following  factors  considered   –  About  the  learning  problem   •  Diﬃculty  of  the  learning  problem   •  What  models  are  successfully  used  in  other  similar  learning  problem   –  About  the  data   •  Amount  of  samples  could  be  observed;  amount  of  features;   interac8ve  between  features;  outliers  in  data   –  Speciﬁc  requirements   •  Interpretability,  Computa8onal/storage  cost
5. 5. Strategy   •  Dis8nguish  good  classiﬁers  from  bad  ones  in   hypothesis  space  by  deﬁne  a  loss  func8on   1 Remp ( f ) = N N ∑ L( y , f ( x )) + regularization i i   •  Typical  loss  func8on   i =1 •  For  classiﬁca8on   –  0-­‐1  LF,  Logarithmic  LF,  binomial  deviance  LF,  exponen8al  LF,   Hinge  LF   •  For  regression   –  Quadra8c  LF,  Absolute  LF,  Huber  LF
6. 6. Logarithmic  loss  func8on   •  Loss  func8on     L(Y , P(Y | X )) = − log P(Y | X ) –  Binomial  logarithmic  loss  func8on   L(Y , P(Y | X )) = − y log P( y = 1| X ) − (1 − y ) log P( y = 0 | X ) •  Minimize  logarithmic  loss  =  Maximize  likelihood   es8ma8on
7. 7. 3  typical  loss  func8ons  for  classiﬁca8on   •  binomial  deviance  loss  func8on   L( y, f ( x)) = log[1 + exp(− yf ( x))] •  Exponen8al  loss  func8on    L( y, f ( x)) = exp(− yf ( x)) •  Hinge  loss  func8on   L( y, f ( x)) = [1 − yf ( x)]+
8. 8. loss  func8ons  for  classiﬁca8on   From  “Elements  of  sta/s/cal  learning”
9. 9. Loss  func8ons  for  regression   From  “Elements  of  sta/s/cal  learning”
10. 10. Op8miza8on   •  Nothing  to  share  this  8me
11. 11. Components  of  typical  algorithms   “model”   Representa\$on   Strategy   Op\$miza\$on   Polynomial   regression   Polynomial  func8on   Squared  loss  usually   Has  closed  solu8on   Linear  regression   Linear  model  of   variable   Squared  loss  usually   has  closed  solu8on   LR   Linear  func8on+   Logit  link   logarithmic  loss   ANN   Mul8  level  linear   Squared  loss  usually   Gradient  descent   func8on  +  Logit  link   SVM   Linear  func8on   Hinge  loss   quadra8c   programming   (SMO)   HMM   Bayes  network   Logarithmic  loss   EM   Adaboost   Addi8ve  model   Exponen8al  loss   Stagewise  +   op8mize  base   learner     Gradient  descent,   Newton  method
12. 12. Boos8ng  Tree   •  •  •  •  Addi8ve  model  and  forward  stagewise  algorithm   Boos8ng  tree   Adaboost   Gradient  boos8ng  tree
13. 13. Addi8ve  model   •  Linear  combina8on  of  base  predictor   M f ( x) = ∑ β mb( x; rm ) m =1 •  Determine  f(x)   N M i =1 m =1 min ∑ L( yi , ∑ β mb( xi ; rm )) β m , rm –  Which  is  diﬃcult  to  inference  for  general  loss  func8on  and   base  learner
14. 14. Forward  Stagewise  Addi8ve  Modeling   •  Idea:  Approximately  inference  by  learning  base   func8on  one  by  one   (1). f 0 ( x) = 0 (2). for m = 1, 2,..., M N (a). ( β m , rm ) = arg min ∑ L( yi , f m −1 ( xi ) + β b( xi ; r )); β ,r i =1 (b). f m ( x) = f m −1 ( x) + β mb( x; rm ) M (3). f ( x) = f M ( x) = ∑ β mb( x; rm ) m =1
15. 15. Boos8ng  tree   •  Boos8ng  tree  =  forward  stagewise  addi8ve  modeling   with  decision  tree  as  base  learner   f m ( x) = f m −1 ( x) + T ( x; Θ m ) N ∧ Θ m = arg min ∑ L( yi , f m −1 ( xi ) + T ( xi ; Θ m )) Θm i =1 •  Diﬀerent  implementa8ons  of  boos8ng  tree  with   diﬀerent  loss  func8on   •  Could  be  used  for  regression  and  classiﬁca8on  both
16. 16. Boos8ng  tree  for  regression   •  When  quadra8c  loss  func8on  is  chosen   Input : training set T = {( x1 , y1 ), ( x2 , y2 ),..., ( xN , y N )}, xi ∈ R n , yi ∈ R Output :boosting tree for regression f M ( x ) 1. Init f 0 ( x) = 0 2. For m = 1to M : (a ). residual rmi = yi − f m −1 ( xi ), i = 1, 2,..., N (b). learn a regressiontreeT ( x; Θ m ) by fitting rmi (c). update f m ( x) = f m −1 ( x) + T ( x; Θ m ) 3. get final regressionboosting tree M f M ( x) = ∑ T ( x; Θ m ) m =1
17. 17. Boos8ng  tree  for  classiﬁca8on   •  When  exponen8al  loss  func8on  is  chosen   –  Adaboost  +  classiﬁca8on  tree   L( y, f ( x)) = exp(− yf ( x)) •  When  binomial  deviance  loss  func8on  is  chosen   –  LogitBoost  +  classiﬁca8on  tree   L( y, f ( x)) = log[1 + exp(− yf ( x))]
18. 18. Adaboost  review   n Input :training set {(xi , yi )}i=1 , yi = {−1, +1}; interations number M    1.Init weight of training samples W1 = ( w11 ,..., w1i ,..., w1N ), w1i = 1 , i = 1, 2,..., N N 2.For m = 1to M : 1). fit a baselearner using dataset with weightWm : Gm ( x) : χ → {−1,1} N 2).calculateclassificaiton error on training dataset : em = ∑ wmi I (Gm ( xi ) ≠ yi ) i =1 1 − em 1 3). calculatecoeffient of Gm ( x ) using classification error : am = log 2 em 4).update weight of each training sample Wm +1 = ( wm +1,1 ,..., wm +1,i ,..., wm +1, N ), wm +1,i ← wmi exp(−am yi Gm ( xi )) 3. get final classifier M G ( x) = sign( f ( x)) = sign(∑ amGm ( x)) m =1
19. 19. Adaboost  :  forward  stagewise  addi8ve  modeling   with  exponen8al  loss   •  Exponen8al  loss  func8on   L( y, f ( x)) = exp[− yf ( x)] •  Forward  stagewise  addi8ve  modeling       f m ( x) = f m−1 ( x) + amGm ( x) inference am and Gm (x) N (am , Gm ( x)) = arg min ∑ exp[ − yi ( f m −1 ( xi ) + aG ( xi ))] a ,G i =1 N = arg min ∑ wmi exp[− yi aG ( xi ))], wmi = exp[ − yi f m −1 ( xi )] a ,G i =1
20. 20. Adaboost  :  forward  stagewise  addi8ve  modeling   with  exponen8al  loss  (2)   •  Con8nue..   ∑ w exp[− y aG( x ))] = ∑ N mi i i =1 = ∑ i wmi e − a + yi =Gm ( xi ) wmi (e a − e − a ) + yi ≠ Gm ( xi ) ∑ yi ≠ Gm ( xi ) yi ≠ Gm ( xi ) −a ∑ wmi e − a + wmi e − a yi =Gm ( xi ) N = (e − e )∑ wmi I ( yi ≠ G ( xi )) + e a ∑ wmi e a −a N ∑w   InferenceGm ( x): for any a > 0, we have i =1 mi i =1 N G m ( x) = arg min ∑ wmi I ( yi ≠ G ( xi )) ∗    Inference am           ⇒a ∗ m G i =1 1 − em 1 = log , em = 2 em N ∑w i =1 mi I ( yi ≠ Gm ( xi )) N ∑w i =1 mi N = ∑ wmi I ( yi ≠ Gm ( xi )) i =1
21. 21. Adaboost  :  forward  stagewise  addi8ve  modeling   with  exponen8al  loss  (3)   •  Weight  update  for  each  sample   wm +1,i = exp[− yi f m ( xi )] f m ( x) = f m −1 ( x) + amGm ( x) ⇒ wm +1,i = wm ,i exp(− yi amGm ( x))
22. 22. CART  review   •  Select  variable  according  to  gini   Gini ( D, A) = | D1 | |D | Gini ( D1 ) + 2 Gini ( D2 ), D1 = {( x, y ) ∈ D | A( x) = a}, D2 = D − D1 |D| |D| K K k =1 k =1 Gini ( p ) = ∑ pk (1 − pk ) = 1 − ∑ pk2 •  Could  be  used  for  regression  and  classiﬁca8on   •  Generate  the  tree  as  large  as  possible  ﬁrstly,  and   prune  via  valida8on     •  Parameters   –  Height;    Stop  split  condi8on
23. 23. Experiment   •  Goal:  evaluate  performance  of  boos8ng  tree   •  Algorithms   –  Logis8c  regression   –  CART   –  Boos8ng  tree  (adaboost  +  CART)   •  Hulu  inside  datasets   –  Ad  intelligence
24. 24. Experiment  (2)   •  Task:  predict  whether  the  recall  is  high  or  low  (binary   classiﬁca8on)   •  Dataset:  Ad  intelligence   –  718  samples;  93  features   –  5-­‐fold  cross  valida8on   •  AUC  with  Logis8c  regression:  0.89   •  Parameters  for  boos8ng  tree   –  Tree  height,  base  learner  number,  and  stop  split   condi8ons
25. 25. Experiment  (3)   •  Test  results  with  boos8ng  tree:  0.96   –  0.79  for  single  CART  (height  6)   AUC  on  test  dataset  (5-­‐fold  cross  valida\$on)   AUC   1   0.9   H=2   0.8   H=3   0.7   H=4   0.6   H=5   H=6   0.5   0.4   1   2   3   4   5   6   7   8   9  10   12   14   16   18   20   22   24   26   28   30   32   34   36   38   40   42   44   46   48   50   11   13   15   17   19   21   23   25   27   29   31   33   35   37   39   41   43   45   47   49   base  leaner  number
26. 26. Gradient  boos8ng     •  Allow  op8miza8on  of  an  arbitrary  diﬀeren8able  loss   func8on   •  Use  gradient  descent  idea  to  approximate  the   residual   ⎡ ∂L( y, f ( x)) ⎤ pseduo residual : − ⎢ ⎥ ⎣ ∂f ( x) ⎦ f ( x ) = f m−1 ( x ) –  When  choose  quadra8c  loss  func8on,  it’s  common   residual   1 L( y − f ( x)) = ( y − f ( x)) 2 2
27. 27. Gradient  boos8ng:  Pseudo  code   n Input : training set {(xi , yi )}i=1 ; a differentiable loss function L(y,F(x));interations number M     1.Initialize model with a constant value : n F0 (x)= argmin ∑ L(yi ,r) r i=1 2.For m = 1to M : 1).Compute pseudo - residuals : ⎡ ∂L(y,F(x)) ⎤ rim = - ⎢ ⎥ ⎣ ∂F(x) ⎦ F(x)=Fm-1 (x) for i = 1,...,n. n 2).Fit a baselearner hm (x)to pseudo - residuals(trainusing dataset {(xi ,rim )}i=1 ) 3).Compute multipiler rm by solving the following optimization problem n γ m = argmin ∑ L(yi ,Fm-1 (xi )+ γ hm (xi )) γ i=1 4.Updatethe model : Fm (x)= Fm-1 (x) + γ m hm ( x) 3.Output FM ( x)
28. 28. Gradient  tree  boos8ng   •  Use  decision  tree  as  base  learner   h ( x) = ∑ b I ( x ∈ R )   •  Stagewise  learning  and  choose  r  with  line  search   F ( x) = F ( x) + γ h ( x), γ = arg min ∑ L( y , F ( x ) + γ h ( x ))   •  Friedman  proposes  to  choose  a  separate  op8mize   value  r  for  each  of  the  tree’s  regions   J m j =1 jm jm n m m −1 m m m γ i =1 J Fm ( x) = Fm −1 ( x) + ∑ γ jm I ( x ∈ R jm ), γ jm = arg min j =1 γ i ∑ xi ∈R jm m −1 i m i L( yi , Fm −1 ( xi ) + γ hm ( xi ))
29. 29. Parameters  choice  and  tricks   •  Parameters  choice   –  Terminal  nodes  J:  [4,  8]  is  recommended   –  Itera8ons  M:  selected  by  evalua8on  on  test/valida8on   data     •  Tricks  for  improvement  +ν ⋅ γ m hm ( x), Fm ( x) = Fm −1 ( x) –  Shrinkage:     –  Stochas8c  gradient  boos8ng   0 < v ≤1
30. 30. Boos8ng  Tree  Summary   •  Forward  stagewise  addi8ve  model  with  tree   •  Pros   –  Performance  is  good  usually   –  Adapt  to  regression  and  classiﬁca8on  both   –  No  need  to  transform/normalized  the  data   –  Few  parameters  and  is  easy  to  tune   •  Tips   –  Try  more  loss  func8ons  besides  exponen8al  loss,  especially   when  noise  exists  in  data   –  Bump  is  usually  good
31. 31. Resource   •  Implementa8on/Tools   –  MART(Mul8ply  Addi8ve  regression  tree)   –  Will  share  my  implementa8on  later     •  More  for  boos8ng  tree   –  “Elements  of  sta/s/cal  learning”     –  《统计学习方法》   –  Paralleliza8on:  “Scaling  up  machine  learning”