Feature selection

4,940 views
4,646 views

Published on

introduce typical methods used for feature selection, including filter, wrapper, subset selection

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,940
On SlideShare
0
From Embeds
0
Number of Embeds
307
Actions
Shares
0
Downloads
204
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • Why samples of different categories could be separatedSeparated well -> smaller classification errorDifferent feature has different contribution
  • Noise data : Lots of low frequent features: use ad-id as a feature, easy overfittingMulti-type features:Too many features comparing to samples : feature number > sample number; feature combinationComplex model: ANNSamples to be predicted is inhomogeneous with training & test samples : demographic targeting; time series related
  • Key points: “how to measure the quality of features” and “whether and how to use the underlying algorithms”1. Optimal feature set could only be selected through exhaustive method;2. Among all existing feature selection methods, the feature set are generated by adding or removing some features from set in last step
  • Decision tree
  • Is not a true metric for distance measurement, because it’s not symmetricCould not be negative (Gibbs inequality)Used in topic model
  • Features could be redundant : videoId,contentId
  • With 1000 features, and cost 1 second to build one model on average, would cost about 1 week
  • Both
  • Feature selection

    1. 1. Machine  learning  workshop   guodong@hulu.com   Machine  learning  introduc7on   Logis7c  regression   Feature  selec+on   Boos7ng,  tree  boos7ng     See  more  machine  learning  post:  h>p://dongguo.me      
    2. 2. Outline   •  •  •  •  Introduc7on   Typical  feature  selec7on  methods   Feature  selec7on  in  logis7c  regression   Tips  and  conclusion  
    3. 3. What’s/why  feature  selec7on   •  A  procedure  in  machine  learning  to  find  a  subset  of   features  that  produces  ‘be>er’  model  for  given   dataset   –  Avoid  overfiLng  and  achieve  be>er  generaliza7on  ability   –  Reduce  the  storage  requirement  and  training  7me   –  Interpretability  
    4. 4. When  feature  selec7on  is  important   •  •  •  •  •  •  Noise  data   Lots  of  low  frequent  features   Use  mul7-­‐type  features   Too  many  features  comparing  to  samples   Complex  model   Samples  in  real  scenario  is  inhomogeneous  with   training  &  test  samples  
    5. 5.  When  No.(samples)/No.(feature)  is  large   •  •  •  •  Feature  selec7on  with  Gini  indexing   Algorithm:  Logis7c  regression   Training  samples:    640k;  test  samples:  49K   Feature:  watch  behavior  of  audiences;  show  level  (11327)   AUC   0.83   0.82   L1-­‐LR   L2-­‐LR   0.81   0.8   all   80%   70%   60%   50%   40%   30%   ra+o  of  features  used   20%   10%  
    6. 6. When  No(samples)  equals  to  No(feature)   •  L1  Logis7c  regression;     •  Training  samples:  50k;  test  samples:  49K   •  Feature:  watch  behavior  of  audiences;  video  level  (49166)   how  AUC  change  with  feature  number  selected   0.736   0.735   0.734   0.733   0.732   ACU   0.731   0.73   0.729   0.728   all   90%   80%   70%   60%   50%   40%   30%   20%   10%  
    7. 7. Typical  methods  for  feature  selec7on   •  Categories   Single  feature   evalua7on   Subset  selec7on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy   For  LR  (SFO,   using  single  feature   Graiing)   •  Single  feature  evalua7on   –  Frequency  based,  mutual  informa7on,  KL  divergence,  Gini  indexing,   informa7on  gain,  Chi  square  sta7s7c   •  Subset  selec7on  method   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on  
    8. 8. Single  feature  evalua7on   •  Measure  quality  of  features  by  all  kinds  of  metrics   –  Frequency  based   –  Dependence  of  feature  and  label  (Co-­‐occurrence)   •  mutual  informa7on,  Chi  square  sta7s7c   –  Informa7on  theory   •  KL  divergence,  informa7on  gain   –  Gini  indexing  
    9. 9. Frequency  based   •  Remove  features  according  to  frequency  of  features   or  instances  contain  the  feature   •  Typical  scenario   –  Text  mining  
    10. 10. Mutual  informa7on   •  Measure  the  dependence  of  two  random  variables   •  Defini7on  
    11. 11. Chi  Square  Sta7s7c   •  Measure  the  dependence  of  two  variables     –  A:  number  of  7mes  feature  t  and  category  c  co-­‐occur   –  B:  number  of  7mes  t  occurs  without  c   –  C:  number  of  7mes  c  occurs  without  t   –  D:  number  of  7mes  neither  c  or  t  occurs   –  N:  total  number  of  instances  
    12. 12. Entropy   •  Characterize  the  (im)purity  of  an  collec7on  of   examples   𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖  𝐼𝑛​ 𝑃↓𝑖   
    13. 13. Informa7on  Gain   •  Reduc7on  in  entropy  caused  by  par77oning  the   examples  according  to  the  a>ribute  
    14. 14. KL  divergence   •  Measure  the  difference  between  two  probability   distribu7on  
    15. 15. Gini  indexing   •  Calculate  condi7onal  probability  of  f  given  class  label     •  Normalize  across  all  classes     •  Calculate  Gini  coefficient       •  For  two  categories  case  
    16. 16. Comparison  in  text  categoriza7on  (1)   •  A  compara)ve  study  on  feature  selec)on  in  text  categoriza)on  (ICML’97)  
    17. 17. Comparison  in  text  categoriza7on  (2)   •  Feature  selec)on  for  text  classifica)on  based  on  Gini  Coefficient  of   Inequality  (JMLR’03)  
    18. 18. Shortages  of  single  feature  evalua7on   •  Relevance  between  features  are  ignored   –  Features  could  be  redundant     –  A  feature  that  is  completely  useless  by  itself  can  provide  a   significant  performance  improvement  when  taken  with   others   –  Two  features  that  are  useless  by  themselves  can  be  useful   together  
    19. 19. Shortages  of  single  feature  evalua7on  (2)   •  A  feature  that  is  completely  useless  by  itself  can   provide  a  significant  performance  improvement   when  taken  with  others  
    20. 20. Shortages  of  single  feature  evalua7on  (3)   •  Two  features  that  are  useless  by  themselves  can  be   useful  together  
    21. 21. Subset  selec7on  methods   •  Select  subsets  of  features  that  together  have  good   predic7ve  power,  as  opposed  to  ranking  features   individually   •  Always  by  adding  new  features  into  exis7ng  set  or   removing  features  out  of  exis7ng  set   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on   •  Evalua7on   –  category  distance  measurement   –  Classifica7on  error  
    22. 22. Category  distance  measurement   •  Select  features  subset  with  large  category  distance  
    23. 23. Wrapper  methods  for  logis7c  regression   •  Forward  feature  selec7on   –  Naïve  method   •  need  build  models  quadra7c  in  the  number  of  feature   –  Graiing   –  Single  feature  op7miza7on  
    24. 24. SFO  (Singhet  al.,  2009)   •  Only  op7mizing  coefficient  of  the  new  feature   •  Only  need  iterate  over  instances  that  contain  the   new  feature   •  Also  fully  relearn  one  new  model  with  selected   feature  included  
    25. 25. Graiing  (Perkins  2003)   •  Use  the  loss  func7on’s  gradient  with  respect  to  the   new  feature  to  decide  whether  to  add  the  feature   •  At  each  step,  feature  with  largest  gradient  is  added   •  Model  is  fully  relearned  aier  each  feature  is  added   –  Need  only  build  D  models  overall  
    26. 26. Experimenta7on   •  Percent  improvement  of  log-­‐likelihood  in  test  set   •  Both  SFO  and  Graiing  are  easy  parallelized  
    27. 27. Summariza7on     •  Categories   Single  feature  evalua+on   Subset  selec+on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy  using   single  feature   For  LR  (SFO,   Graiing)   •  Filter  +  Single  feature  evalua7on   –  Less  7me  consuming,  usually  works  well   •  Wrapper  +  Subset  selec7on   –  Higher  accuracy,  but  easy  overfiLng    
    28. 28. Tips  about  feature  selec7on   •  •  •  •  Remove  features  could  not  occur  in  real  scenario   If  no  contribu7on,  the  less  feature  the  be>er   Use  L1  regulariza7on  for  logis7c  regression   Use  random  subspace  method  
    29. 29. References   •  Feature  selec)on  for  Classifica)on  (IDA’97)   •  An  Introduc)on  to  Variable  and  Feature  Selec)on  (JMLR’03)   •  Feature  selec)on  for  text  classifica)on  Based  on  Gini   Coefficient  of  Inequality  (JMLR’03)   •  A  compara)ve  study  on  feature  selec)on  in  text   categoriza)on  (ICML’97)   •  Scaling  Up  Machine  Learning  

    ×