Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine	
  learning	
  workshop	
  
guodong@hulu.com	
  

Machine	
  learning	
  introduc7on	
  
Logis7c	
  regression	
  ...
Outline	
  
• 
• 
• 
• 

Introduc7on	
  
Typical	
  feature	
  selec7on	
  methods	
  
Feature	
  selec7on	
  in	
  logis7...
What’s/why	
  feature	
  selec7on	
  
•  A	
  procedure	
  in	
  machine	
  learning	
  to	
  find	
  a	
  subset	
  of	
  ...
When	
  feature	
  selec7on	
  is	
  important	
  
• 
• 
• 
• 
• 
• 

Noise	
  data	
  
Lots	
  of	
  low	
  frequent	
  f...
 When	
  No.(samples)/No.(feature)	
  is	
  large	
  
• 
• 
• 
• 

Feature	
  selec7on	
  with	
  Gini	
  indexing	
  
Alg...
When	
  No(samples)	
  equals	
  to	
  No(feature)	
  
•  L1	
  Logis7c	
  regression;	
  	
  
•  Training	
  samples:	
  ...
Typical	
  methods	
  for	
  feature	
  selec7on	
  
•  Categories	
  
Single	
  feature	
  
evalua7on	
  

Subset	
  sele...
Single	
  feature	
  evalua7on	
  
•  Measure	
  quality	
  of	
  features	
  by	
  all	
  kinds	
  of	
  metrics	
  
–  F...
Frequency	
  based	
  
•  Remove	
  features	
  according	
  to	
  frequency	
  of	
  features	
  
or	
  instances	
  cont...
Mutual	
  informa7on	
  
•  Measure	
  the	
  dependence	
  of	
  two	
  random	
  variables	
  
•  Defini7on	
  
Chi	
  Square	
  Sta7s7c	
  
•  Measure	
  the	
  dependence	
  of	
  two	
  variables	
  

	
  
–  A:	
  number	
  of	
  ...
Entropy	
  
•  Characterize	
  the	
  (im)purity	
  of	
  an	
  collec7on	
  of	
  
examples	
  
𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖...
Informa7on	
  Gain	
  
•  Reduc7on	
  in	
  entropy	
  caused	
  by	
  par77oning	
  the	
  
examples	
  according	
  to	
...
KL	
  divergence	
  
•  Measure	
  the	
  difference	
  between	
  two	
  probability	
  
distribu7on	
  
Gini	
  indexing	
  
•  Calculate	
  condi7onal	
  probability	
  of	
  f	
  given	
  class	
  label	
  
	
  
•  Normalize...
Comparison	
  in	
  text	
  categoriza7on	
  (1)	
  
•  A	
  compara)ve	
  study	
  on	
  feature	
  selec)on	
  in	
  tex...
Comparison	
  in	
  text	
  categoriza7on	
  (2)	
  
•  Feature	
  selec)on	
  for	
  text	
  classifica)on	
  based	
  on	...
Shortages	
  of	
  single	
  feature	
  evalua7on	
  
•  Relevance	
  between	
  features	
  are	
  ignored	
  
–  Feature...
Shortages	
  of	
  single	
  feature	
  evalua7on	
  (2)	
  
•  A	
  feature	
  that	
  is	
  completely	
  useless	
  by	...
Shortages	
  of	
  single	
  feature	
  evalua7on	
  (3)	
  
•  Two	
  features	
  that	
  are	
  useless	
  by	
  themsel...
Subset	
  selec7on	
  methods	
  
•  Select	
  subsets	
  of	
  features	
  that	
  together	
  have	
  good	
  
predic7ve...
Category	
  distance	
  measurement	
  
•  Select	
  features	
  subset	
  with	
  large	
  category	
  distance	
  
Wrapper	
  methods	
  for	
  logis7c	
  regression	
  
•  Forward	
  feature	
  selec7on	
  
–  Naïve	
  method	
  
•  nee...
SFO	
  (Singhet	
  al.,	
  2009)	
  
•  Only	
  op7mizing	
  coefficient	
  of	
  the	
  new	
  feature	
  

•  Only	
  need...
Graiing	
  (Perkins	
  2003)	
  
•  Use	
  the	
  loss	
  func7on’s	
  gradient	
  with	
  respect	
  to	
  the	
  
new	
 ...
Experimenta7on	
  
•  Percent	
  improvement	
  of	
  log-­‐likelihood	
  in	
  test	
  set	
  

•  Both	
  SFO	
  and	
  ...
Summariza7on	
  	
  
•  Categories	
  
Single	
  feature	
  evalua+on	
  

Subset	
  selec+on	
  

filter	
  

MI,	
  IG,	
...
Tips	
  about	
  feature	
  selec7on	
  
• 
• 
• 
• 

Remove	
  features	
  could	
  not	
  occur	
  in	
  real	
  scenari...
References	
  
•  Feature	
  selec)on	
  for	
  Classifica)on	
  (IDA’97)	
  
•  An	
  Introduc)on	
  to	
  Variable	
  and...
Upcoming SlideShare
Loading in …5
×

Feature selection

7,944 views

Published on

introduce typical methods used for feature selection, including filter, wrapper, subset selection

Published in: Technology, Education
  • Be the first to comment

Feature selection

  1. 1. Machine  learning  workshop   guodong@hulu.com   Machine  learning  introduc7on   Logis7c  regression   Feature  selec+on   Boos7ng,  tree  boos7ng     See  more  machine  learning  post:  h>p://dongguo.me      
  2. 2. Outline   •  •  •  •  Introduc7on   Typical  feature  selec7on  methods   Feature  selec7on  in  logis7c  regression   Tips  and  conclusion  
  3. 3. What’s/why  feature  selec7on   •  A  procedure  in  machine  learning  to  find  a  subset  of   features  that  produces  ‘be>er’  model  for  given   dataset   –  Avoid  overfiLng  and  achieve  be>er  generaliza7on  ability   –  Reduce  the  storage  requirement  and  training  7me   –  Interpretability  
  4. 4. When  feature  selec7on  is  important   •  •  •  •  •  •  Noise  data   Lots  of  low  frequent  features   Use  mul7-­‐type  features   Too  many  features  comparing  to  samples   Complex  model   Samples  in  real  scenario  is  inhomogeneous  with   training  &  test  samples  
  5. 5.  When  No.(samples)/No.(feature)  is  large   •  •  •  •  Feature  selec7on  with  Gini  indexing   Algorithm:  Logis7c  regression   Training  samples:    640k;  test  samples:  49K   Feature:  watch  behavior  of  audiences;  show  level  (11327)   AUC   0.83   0.82   L1-­‐LR   L2-­‐LR   0.81   0.8   all   80%   70%   60%   50%   40%   30%   ra+o  of  features  used   20%   10%  
  6. 6. When  No(samples)  equals  to  No(feature)   •  L1  Logis7c  regression;     •  Training  samples:  50k;  test  samples:  49K   •  Feature:  watch  behavior  of  audiences;  video  level  (49166)   how  AUC  change  with  feature  number  selected   0.736   0.735   0.734   0.733   0.732   ACU   0.731   0.73   0.729   0.728   all   90%   80%   70%   60%   50%   40%   30%   20%   10%  
  7. 7. Typical  methods  for  feature  selec7on   •  Categories   Single  feature   evalua7on   Subset  selec7on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy   For  LR  (SFO,   using  single  feature   Graiing)   •  Single  feature  evalua7on   –  Frequency  based,  mutual  informa7on,  KL  divergence,  Gini  indexing,   informa7on  gain,  Chi  square  sta7s7c   •  Subset  selec7on  method   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on  
  8. 8. Single  feature  evalua7on   •  Measure  quality  of  features  by  all  kinds  of  metrics   –  Frequency  based   –  Dependence  of  feature  and  label  (Co-­‐occurrence)   •  mutual  informa7on,  Chi  square  sta7s7c   –  Informa7on  theory   •  KL  divergence,  informa7on  gain   –  Gini  indexing  
  9. 9. Frequency  based   •  Remove  features  according  to  frequency  of  features   or  instances  contain  the  feature   •  Typical  scenario   –  Text  mining  
  10. 10. Mutual  informa7on   •  Measure  the  dependence  of  two  random  variables   •  Defini7on  
  11. 11. Chi  Square  Sta7s7c   •  Measure  the  dependence  of  two  variables     –  A:  number  of  7mes  feature  t  and  category  c  co-­‐occur   –  B:  number  of  7mes  t  occurs  without  c   –  C:  number  of  7mes  c  occurs  without  t   –  D:  number  of  7mes  neither  c  or  t  occurs   –  N:  total  number  of  instances  
  12. 12. Entropy   •  Characterize  the  (im)purity  of  an  collec7on  of   examples   𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖  𝐼𝑛​ 𝑃↓𝑖   
  13. 13. Informa7on  Gain   •  Reduc7on  in  entropy  caused  by  par77oning  the   examples  according  to  the  a>ribute  
  14. 14. KL  divergence   •  Measure  the  difference  between  two  probability   distribu7on  
  15. 15. Gini  indexing   •  Calculate  condi7onal  probability  of  f  given  class  label     •  Normalize  across  all  classes     •  Calculate  Gini  coefficient       •  For  two  categories  case  
  16. 16. Comparison  in  text  categoriza7on  (1)   •  A  compara)ve  study  on  feature  selec)on  in  text  categoriza)on  (ICML’97)  
  17. 17. Comparison  in  text  categoriza7on  (2)   •  Feature  selec)on  for  text  classifica)on  based  on  Gini  Coefficient  of   Inequality  (JMLR’03)  
  18. 18. Shortages  of  single  feature  evalua7on   •  Relevance  between  features  are  ignored   –  Features  could  be  redundant     –  A  feature  that  is  completely  useless  by  itself  can  provide  a   significant  performance  improvement  when  taken  with   others   –  Two  features  that  are  useless  by  themselves  can  be  useful   together  
  19. 19. Shortages  of  single  feature  evalua7on  (2)   •  A  feature  that  is  completely  useless  by  itself  can   provide  a  significant  performance  improvement   when  taken  with  others  
  20. 20. Shortages  of  single  feature  evalua7on  (3)   •  Two  features  that  are  useless  by  themselves  can  be   useful  together  
  21. 21. Subset  selec7on  methods   •  Select  subsets  of  features  that  together  have  good   predic7ve  power,  as  opposed  to  ranking  features   individually   •  Always  by  adding  new  features  into  exis7ng  set  or   removing  features  out  of  exis7ng  set   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on   •  Evalua7on   –  category  distance  measurement   –  Classifica7on  error  
  22. 22. Category  distance  measurement   •  Select  features  subset  with  large  category  distance  
  23. 23. Wrapper  methods  for  logis7c  regression   •  Forward  feature  selec7on   –  Naïve  method   •  need  build  models  quadra7c  in  the  number  of  feature   –  Graiing   –  Single  feature  op7miza7on  
  24. 24. SFO  (Singhet  al.,  2009)   •  Only  op7mizing  coefficient  of  the  new  feature   •  Only  need  iterate  over  instances  that  contain  the   new  feature   •  Also  fully  relearn  one  new  model  with  selected   feature  included  
  25. 25. Graiing  (Perkins  2003)   •  Use  the  loss  func7on’s  gradient  with  respect  to  the   new  feature  to  decide  whether  to  add  the  feature   •  At  each  step,  feature  with  largest  gradient  is  added   •  Model  is  fully  relearned  aier  each  feature  is  added   –  Need  only  build  D  models  overall  
  26. 26. Experimenta7on   •  Percent  improvement  of  log-­‐likelihood  in  test  set   •  Both  SFO  and  Graiing  are  easy  parallelized  
  27. 27. Summariza7on     •  Categories   Single  feature  evalua+on   Subset  selec+on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy  using   single  feature   For  LR  (SFO,   Graiing)   •  Filter  +  Single  feature  evalua7on   –  Less  7me  consuming,  usually  works  well   •  Wrapper  +  Subset  selec7on   –  Higher  accuracy,  but  easy  overfiLng    
  28. 28. Tips  about  feature  selec7on   •  •  •  •  Remove  features  could  not  occur  in  real  scenario   If  no  contribu7on,  the  less  feature  the  be>er   Use  L1  regulariza7on  for  logis7c  regression   Use  random  subspace  method  
  29. 29. References   •  Feature  selec)on  for  Classifica)on  (IDA’97)   •  An  Introduc)on  to  Variable  and  Feature  Selec)on  (JMLR’03)   •  Feature  selec)on  for  text  classifica)on  Based  on  Gini   Coefficient  of  Inequality  (JMLR’03)   •  A  compara)ve  study  on  feature  selec)on  in  text   categoriza)on  (ICML’97)   •  Scaling  Up  Machine  Learning  

×