Machine	
  learning	
  workshop	
  
guodong@hulu.com	
  

Machine	
  learning	
  introduc7on	
  
Logis7c	
  regression	
  

Feature	
  selec+on	
  

Boos7ng,	
  tree	
  boos7ng	
  
	
  
See	
  more	
  machine	
  learning	
  post:	
  h>p://dongguo.me	
  	
  
	
  
Outline	
  
• 
• 
• 
• 

Introduc7on	
  
Typical	
  feature	
  selec7on	
  methods	
  
Feature	
  selec7on	
  in	
  logis7c	
  regression	
  
Tips	
  and	
  conclusion	
  
What’s/why	
  feature	
  selec7on	
  
•  A	
  procedure	
  in	
  machine	
  learning	
  to	
  find	
  a	
  subset	
  of	
  
features	
  that	
  produces	
  ‘be>er’	
  model	
  for	
  given	
  
dataset	
  
–  Avoid	
  overfiLng	
  and	
  achieve	
  be>er	
  generaliza7on	
  ability	
  
–  Reduce	
  the	
  storage	
  requirement	
  and	
  training	
  7me	
  
–  Interpretability	
  
When	
  feature	
  selec7on	
  is	
  important	
  
• 
• 
• 
• 
• 
• 

Noise	
  data	
  
Lots	
  of	
  low	
  frequent	
  features	
  
Use	
  mul7-­‐type	
  features	
  
Too	
  many	
  features	
  comparing	
  to	
  samples	
  
Complex	
  model	
  
Samples	
  in	
  real	
  scenario	
  is	
  inhomogeneous	
  with	
  
training	
  &	
  test	
  samples	
  
 When	
  No.(samples)/No.(feature)	
  is	
  large	
  
• 
• 
• 
• 

Feature	
  selec7on	
  with	
  Gini	
  indexing	
  
Algorithm:	
  Logis7c	
  regression	
  
Training	
  samples:	
  	
  640k;	
  test	
  samples:	
  49K	
  
Feature:	
  watch	
  behavior	
  of	
  audiences;	
  show	
  level	
  (11327)	
  
AUC	
  

0.83	
  

0.82	
  

L1-­‐LR	
  
L2-­‐LR	
  

0.81	
  

0.8	
  
all	
  

80%	
  

70%	
  

60%	
  

50%	
  

40%	
  

30%	
  

ra+o	
  of	
  features	
  used	
  

20%	
  

10%	
  
When	
  No(samples)	
  equals	
  to	
  No(feature)	
  
•  L1	
  Logis7c	
  regression;	
  	
  
•  Training	
  samples:	
  50k;	
  test	
  samples:	
  49K	
  
•  Feature:	
  watch	
  behavior	
  of	
  audiences;	
  video	
  level	
  (49166)	
  
how	
  AUC	
  change	
  with	
  feature	
  number	
  selected	
  
0.736	
  
0.735	
  
0.734	
  
0.733	
  
0.732	
  

ACU	
  

0.731	
  
0.73	
  
0.729	
  
0.728	
  
all	
  

90%	
   80%	
   70%	
   60%	
   50%	
   40%	
   30%	
   20%	
   10%	
  
Typical	
  methods	
  for	
  feature	
  selec7on	
  
•  Categories	
  
Single	
  feature	
  
evalua7on	
  

Subset	
  selec7on	
  

filter	
  

MI,	
  IG,	
  KL-­‐D,	
  GI,	
  CHI	
  	
   Category	
  distance,	
  
…	
  

wrapper	
  

Ranking	
  accuracy	
  
For	
  LR	
  (SFO,	
  
using	
  single	
  feature	
   Graiing)	
  

•  Single	
  feature	
  evalua7on	
  

–  Frequency	
  based,	
  mutual	
  informa7on,	
  KL	
  divergence,	
  Gini	
  indexing,	
  
informa7on	
  gain,	
  Chi	
  square	
  sta7s7c	
  

•  Subset	
  selec7on	
  method	
  

–  Sequen7al	
  forward	
  selec7on	
  
–  Sequen7al	
  backward	
  selec7on	
  
Single	
  feature	
  evalua7on	
  
•  Measure	
  quality	
  of	
  features	
  by	
  all	
  kinds	
  of	
  metrics	
  
–  Frequency	
  based	
  
–  Dependence	
  of	
  feature	
  and	
  label	
  (Co-­‐occurrence)	
  
•  mutual	
  informa7on,	
  Chi	
  square	
  sta7s7c	
  

–  Informa7on	
  theory	
  
•  KL	
  divergence,	
  informa7on	
  gain	
  

–  Gini	
  indexing	
  
Frequency	
  based	
  
•  Remove	
  features	
  according	
  to	
  frequency	
  of	
  features	
  
or	
  instances	
  contain	
  the	
  feature	
  
•  Typical	
  scenario	
  
–  Text	
  mining	
  
Mutual	
  informa7on	
  
•  Measure	
  the	
  dependence	
  of	
  two	
  random	
  variables	
  
•  Defini7on	
  
Chi	
  Square	
  Sta7s7c	
  
•  Measure	
  the	
  dependence	
  of	
  two	
  variables	
  

	
  
–  A:	
  number	
  of	
  7mes	
  feature	
  t	
  and	
  category	
  c	
  co-­‐occur	
  
–  B:	
  number	
  of	
  7mes	
  t	
  occurs	
  without	
  c	
  
–  C:	
  number	
  of	
  7mes	
  c	
  occurs	
  without	
  t	
  
–  D:	
  number	
  of	
  7mes	
  neither	
  c	
  or	
  t	
  occurs	
  
–  N:	
  total	
  number	
  of	
  instances	
  
Entropy	
  
•  Characterize	
  the	
  (im)purity	
  of	
  an	
  collec7on	
  of	
  
examples	
  
𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖  𝐼𝑛​ 𝑃↓𝑖 	
  
Informa7on	
  Gain	
  
•  Reduc7on	
  in	
  entropy	
  caused	
  by	
  par77oning	
  the	
  
examples	
  according	
  to	
  the	
  a>ribute	
  
KL	
  divergence	
  
•  Measure	
  the	
  difference	
  between	
  two	
  probability	
  
distribu7on	
  
Gini	
  indexing	
  
•  Calculate	
  condi7onal	
  probability	
  of	
  f	
  given	
  class	
  label	
  
	
  
•  Normalize	
  across	
  all	
  classes	
  
	
  
•  Calculate	
  Gini	
  coefficient	
  
	
  
	
  
•  For	
  two	
  categories	
  case	
  
Comparison	
  in	
  text	
  categoriza7on	
  (1)	
  
•  A	
  compara)ve	
  study	
  on	
  feature	
  selec)on	
  in	
  text	
  categoriza)on	
  (ICML’97)	
  
Comparison	
  in	
  text	
  categoriza7on	
  (2)	
  
•  Feature	
  selec)on	
  for	
  text	
  classifica)on	
  based	
  on	
  Gini	
  Coefficient	
  of	
  
Inequality	
  (JMLR’03)	
  
Shortages	
  of	
  single	
  feature	
  evalua7on	
  
•  Relevance	
  between	
  features	
  are	
  ignored	
  
–  Features	
  could	
  be	
  redundant	
  	
  
–  A	
  feature	
  that	
  is	
  completely	
  useless	
  by	
  itself	
  can	
  provide	
  a	
  
significant	
  performance	
  improvement	
  when	
  taken	
  with	
  
others	
  
–  Two	
  features	
  that	
  are	
  useless	
  by	
  themselves	
  can	
  be	
  useful	
  
together	
  
Shortages	
  of	
  single	
  feature	
  evalua7on	
  (2)	
  
•  A	
  feature	
  that	
  is	
  completely	
  useless	
  by	
  itself	
  can	
  
provide	
  a	
  significant	
  performance	
  improvement	
  
when	
  taken	
  with	
  others	
  
Shortages	
  of	
  single	
  feature	
  evalua7on	
  (3)	
  
•  Two	
  features	
  that	
  are	
  useless	
  by	
  themselves	
  can	
  be	
  
useful	
  together	
  
Subset	
  selec7on	
  methods	
  
•  Select	
  subsets	
  of	
  features	
  that	
  together	
  have	
  good	
  
predic7ve	
  power,	
  as	
  opposed	
  to	
  ranking	
  features	
  
individually	
  
•  Always	
  by	
  adding	
  new	
  features	
  into	
  exis7ng	
  set	
  or	
  
removing	
  features	
  out	
  of	
  exis7ng	
  set	
  
–  Sequen7al	
  forward	
  selec7on	
  
–  Sequen7al	
  backward	
  selec7on	
  

•  Evalua7on	
  
–  category	
  distance	
  measurement	
  
–  Classifica7on	
  error	
  
Category	
  distance	
  measurement	
  
•  Select	
  features	
  subset	
  with	
  large	
  category	
  distance	
  
Wrapper	
  methods	
  for	
  logis7c	
  regression	
  
•  Forward	
  feature	
  selec7on	
  
–  Naïve	
  method	
  
•  need	
  build	
  models	
  quadra7c	
  in	
  the	
  number	
  of	
  feature	
  

–  Graiing	
  
–  Single	
  feature	
  op7miza7on	
  
SFO	
  (Singhet	
  al.,	
  2009)	
  
•  Only	
  op7mizing	
  coefficient	
  of	
  the	
  new	
  feature	
  

•  Only	
  need	
  iterate	
  over	
  instances	
  that	
  contain	
  the	
  
new	
  feature	
  
•  Also	
  fully	
  relearn	
  one	
  new	
  model	
  with	
  selected	
  
feature	
  included	
  
Graiing	
  (Perkins	
  2003)	
  
•  Use	
  the	
  loss	
  func7on’s	
  gradient	
  with	
  respect	
  to	
  the	
  
new	
  feature	
  to	
  decide	
  whether	
  to	
  add	
  the	
  feature	
  

•  At	
  each	
  step,	
  feature	
  with	
  largest	
  gradient	
  is	
  added	
  
•  Model	
  is	
  fully	
  relearned	
  aier	
  each	
  feature	
  is	
  added	
  
–  Need	
  only	
  build	
  D	
  models	
  overall	
  
Experimenta7on	
  
•  Percent	
  improvement	
  of	
  log-­‐likelihood	
  in	
  test	
  set	
  

•  Both	
  SFO	
  and	
  Graiing	
  are	
  easy	
  parallelized	
  
Summariza7on	
  	
  
•  Categories	
  
Single	
  feature	
  evalua+on	
  

Subset	
  selec+on	
  

filter	
  

MI,	
  IG,	
  KL-­‐D,	
  GI,	
  CHI	
  	
  

Category	
  distance,	
  
…	
  

wrapper	
  

Ranking	
  accuracy	
  using	
  
single	
  feature	
  

For	
  LR	
  (SFO,	
  
Graiing)	
  

•  Filter	
  +	
  Single	
  feature	
  evalua7on	
  
–  Less	
  7me	
  consuming,	
  usually	
  works	
  well	
  

•  Wrapper	
  +	
  Subset	
  selec7on	
  
–  Higher	
  accuracy,	
  but	
  easy	
  overfiLng	
  	
  
Tips	
  about	
  feature	
  selec7on	
  
• 
• 
• 
• 

Remove	
  features	
  could	
  not	
  occur	
  in	
  real	
  scenario	
  
If	
  no	
  contribu7on,	
  the	
  less	
  feature	
  the	
  be>er	
  
Use	
  L1	
  regulariza7on	
  for	
  logis7c	
  regression	
  
Use	
  random	
  subspace	
  method	
  
References	
  
•  Feature	
  selec)on	
  for	
  Classifica)on	
  (IDA’97)	
  
•  An	
  Introduc)on	
  to	
  Variable	
  and	
  Feature	
  Selec)on	
  (JMLR’03)	
  
•  Feature	
  selec)on	
  for	
  text	
  classifica)on	
  Based	
  on	
  Gini	
  
Coefficient	
  of	
  Inequality	
  (JMLR’03)	
  
•  A	
  compara)ve	
  study	
  on	
  feature	
  selec)on	
  in	
  text	
  
categoriza)on	
  (ICML’97)	
  
•  Scaling	
  Up	
  Machine	
  Learning	
  

Feature selection

  • 1.
    Machine  learning  workshop   guodong@hulu.com   Machine  learning  introduc7on   Logis7c  regression   Feature  selec+on   Boos7ng,  tree  boos7ng     See  more  machine  learning  post:  h>p://dongguo.me      
  • 2.
    Outline   •  •  •  •  Introduc7on   Typical  feature  selec7on  methods   Feature  selec7on  in  logis7c  regression   Tips  and  conclusion  
  • 3.
    What’s/why  feature  selec7on   •  A  procedure  in  machine  learning  to  find  a  subset  of   features  that  produces  ‘be>er’  model  for  given   dataset   –  Avoid  overfiLng  and  achieve  be>er  generaliza7on  ability   –  Reduce  the  storage  requirement  and  training  7me   –  Interpretability  
  • 4.
    When  feature  selec7on  is  important   •  •  •  •  •  •  Noise  data   Lots  of  low  frequent  features   Use  mul7-­‐type  features   Too  many  features  comparing  to  samples   Complex  model   Samples  in  real  scenario  is  inhomogeneous  with   training  &  test  samples  
  • 5.
     When  No.(samples)/No.(feature)  is  large   •  •  •  •  Feature  selec7on  with  Gini  indexing   Algorithm:  Logis7c  regression   Training  samples:    640k;  test  samples:  49K   Feature:  watch  behavior  of  audiences;  show  level  (11327)   AUC   0.83   0.82   L1-­‐LR   L2-­‐LR   0.81   0.8   all   80%   70%   60%   50%   40%   30%   ra+o  of  features  used   20%   10%  
  • 6.
    When  No(samples)  equals  to  No(feature)   •  L1  Logis7c  regression;     •  Training  samples:  50k;  test  samples:  49K   •  Feature:  watch  behavior  of  audiences;  video  level  (49166)   how  AUC  change  with  feature  number  selected   0.736   0.735   0.734   0.733   0.732   ACU   0.731   0.73   0.729   0.728   all   90%   80%   70%   60%   50%   40%   30%   20%   10%  
  • 7.
    Typical  methods  for  feature  selec7on   •  Categories   Single  feature   evalua7on   Subset  selec7on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy   For  LR  (SFO,   using  single  feature   Graiing)   •  Single  feature  evalua7on   –  Frequency  based,  mutual  informa7on,  KL  divergence,  Gini  indexing,   informa7on  gain,  Chi  square  sta7s7c   •  Subset  selec7on  method   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on  
  • 8.
    Single  feature  evalua7on   •  Measure  quality  of  features  by  all  kinds  of  metrics   –  Frequency  based   –  Dependence  of  feature  and  label  (Co-­‐occurrence)   •  mutual  informa7on,  Chi  square  sta7s7c   –  Informa7on  theory   •  KL  divergence,  informa7on  gain   –  Gini  indexing  
  • 9.
    Frequency  based   • Remove  features  according  to  frequency  of  features   or  instances  contain  the  feature   •  Typical  scenario   –  Text  mining  
  • 10.
    Mutual  informa7on   • Measure  the  dependence  of  two  random  variables   •  Defini7on  
  • 11.
    Chi  Square  Sta7s7c   •  Measure  the  dependence  of  two  variables     –  A:  number  of  7mes  feature  t  and  category  c  co-­‐occur   –  B:  number  of  7mes  t  occurs  without  c   –  C:  number  of  7mes  c  occurs  without  t   –  D:  number  of  7mes  neither  c  or  t  occurs   –  N:  total  number  of  instances  
  • 12.
    Entropy   •  Characterize  the  (im)purity  of  an  collec7on  of   examples   𝐸𝑛𝑡𝑟𝑜𝑝𝑦( 𝑆)=   −∑𝑖↑▒​ 𝑃↓𝑖  𝐼𝑛​ 𝑃↓𝑖   
  • 13.
    Informa7on  Gain   • Reduc7on  in  entropy  caused  by  par77oning  the   examples  according  to  the  a>ribute  
  • 14.
    KL  divergence   • Measure  the  difference  between  two  probability   distribu7on  
  • 15.
    Gini  indexing   • Calculate  condi7onal  probability  of  f  given  class  label     •  Normalize  across  all  classes     •  Calculate  Gini  coefficient       •  For  two  categories  case  
  • 16.
    Comparison  in  text  categoriza7on  (1)   •  A  compara)ve  study  on  feature  selec)on  in  text  categoriza)on  (ICML’97)  
  • 17.
    Comparison  in  text  categoriza7on  (2)   •  Feature  selec)on  for  text  classifica)on  based  on  Gini  Coefficient  of   Inequality  (JMLR’03)  
  • 18.
    Shortages  of  single  feature  evalua7on   •  Relevance  between  features  are  ignored   –  Features  could  be  redundant     –  A  feature  that  is  completely  useless  by  itself  can  provide  a   significant  performance  improvement  when  taken  with   others   –  Two  features  that  are  useless  by  themselves  can  be  useful   together  
  • 19.
    Shortages  of  single  feature  evalua7on  (2)   •  A  feature  that  is  completely  useless  by  itself  can   provide  a  significant  performance  improvement   when  taken  with  others  
  • 20.
    Shortages  of  single  feature  evalua7on  (3)   •  Two  features  that  are  useless  by  themselves  can  be   useful  together  
  • 21.
    Subset  selec7on  methods   •  Select  subsets  of  features  that  together  have  good   predic7ve  power,  as  opposed  to  ranking  features   individually   •  Always  by  adding  new  features  into  exis7ng  set  or   removing  features  out  of  exis7ng  set   –  Sequen7al  forward  selec7on   –  Sequen7al  backward  selec7on   •  Evalua7on   –  category  distance  measurement   –  Classifica7on  error  
  • 22.
    Category  distance  measurement   •  Select  features  subset  with  large  category  distance  
  • 23.
    Wrapper  methods  for  logis7c  regression   •  Forward  feature  selec7on   –  Naïve  method   •  need  build  models  quadra7c  in  the  number  of  feature   –  Graiing   –  Single  feature  op7miza7on  
  • 24.
    SFO  (Singhet  al.,  2009)   •  Only  op7mizing  coefficient  of  the  new  feature   •  Only  need  iterate  over  instances  that  contain  the   new  feature   •  Also  fully  relearn  one  new  model  with  selected   feature  included  
  • 25.
    Graiing  (Perkins  2003)   •  Use  the  loss  func7on’s  gradient  with  respect  to  the   new  feature  to  decide  whether  to  add  the  feature   •  At  each  step,  feature  with  largest  gradient  is  added   •  Model  is  fully  relearned  aier  each  feature  is  added   –  Need  only  build  D  models  overall  
  • 26.
    Experimenta7on   •  Percent  improvement  of  log-­‐likelihood  in  test  set   •  Both  SFO  and  Graiing  are  easy  parallelized  
  • 27.
    Summariza7on     • Categories   Single  feature  evalua+on   Subset  selec+on   filter   MI,  IG,  KL-­‐D,  GI,  CHI     Category  distance,   …   wrapper   Ranking  accuracy  using   single  feature   For  LR  (SFO,   Graiing)   •  Filter  +  Single  feature  evalua7on   –  Less  7me  consuming,  usually  works  well   •  Wrapper  +  Subset  selec7on   –  Higher  accuracy,  but  easy  overfiLng    
  • 28.
    Tips  about  feature  selec7on   •  •  •  •  Remove  features  could  not  occur  in  real  scenario   If  no  contribu7on,  the  less  feature  the  be>er   Use  L1  regulariza7on  for  logis7c  regression   Use  random  subspace  method  
  • 29.
    References   •  Feature  selec)on  for  Classifica)on  (IDA’97)   •  An  Introduc)on  to  Variable  and  Feature  Selec)on  (JMLR’03)   •  Feature  selec)on  for  text  classifica)on  Based  on  Gini   Coefficient  of  Inequality  (JMLR’03)   •  A  compara)ve  study  on  feature  selec)on  in  text   categoriza)on  (ICML’97)   •  Scaling  Up  Machine  Learning  

Editor's Notes

  • #4 Why samples of different categories could be separatedSeparated well -> smaller classification errorDifferent feature has different contribution
  • #5 Noise data : Lots of low frequent features: use ad-id as a feature, easy overfittingMulti-type features:Too many features comparing to samples : feature number > sample number; feature combinationComplex model: ANNSamples to be predicted is inhomogeneous with training & test samples : demographic targeting; time series related
  • #8 Key points: “how to measure the quality of features” and “whether and how to use the underlying algorithms”1. Optimal feature set could only be selected through exhaustive method;2. Among all existing feature selection methods, the feature set are generated by adding or removing some features from set in last step
  • #14 Decision tree
  • #15 Is not a true metric for distance measurement, because it’s not symmetricCould not be negative (Gibbs inequality)Used in topic model
  • #19 Features could be redundant : videoId,contentId
  • #24 With 1000 features, and cost 1 second to build one model on average, would cost about 1 week
  • #27 Both