Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Feature Selection


Published on

Published in: Technology, Education
  • Be the first to comment

Intro to Feature Selection

  1. 1. Introduction to Feature Selection in Data Mining and Knowledge Discovery Huimin Chen Department of Electrical Engineering University of New Orleans New Orleans, LA 70148
  2. 2. Typical Problem <ul><li>Massive data with high dimensionality </li></ul><ul><li>Need to learn regularities from data </li></ul><ul><ul><li>What is the data generation mechanism? </li></ul></ul><ul><ul><li>How to predict the unseen data? </li></ul></ul><ul><li>Model selection is important, but this talk will only focus on a relatively “easier” problem </li></ul><ul><ul><li>Variable selection and feature selection </li></ul></ul><ul><ul><li>Dimensionality reduction </li></ul></ul><ul><li>Have to narrow down to statistical models </li></ul>
  3. 3. Modeling and Variable/Feature Selection <ul><li>Variable/feature selection </li></ul><ul><ul><li>Which factors are important? </li></ul></ul><ul><ul><li>What statistical dependencies are significant? </li></ul></ul><ul><ul><li>Most problems are learning from data </li></ul></ul><ul><li>Bias-variance tradeoff </li></ul><ul><ul><li>A unified view via penalized likelihood </li></ul></ul><ul><ul><li>The key is to design/interpret the penalty term </li></ul></ul><ul><ul><li>Sample based approach (boosting, bagging, random forest) not covered here </li></ul></ul><ul><li>Computational issue </li></ul><ul><ul><li>Efficient optimization techniques </li></ul></ul>
  4. 4. Outline of This Talk <ul><li>Formulation of feature selection problem </li></ul><ul><li>General design criteria for feature selection </li></ul><ul><ul><li>Unified view via penalized likelihood </li></ul></ul><ul><ul><li>Relationship with existing model selection criteria </li></ul></ul><ul><ul><li>Desired property of the penalized likelihood </li></ul></ul><ul><ul><li>Feature selection vs. risk minimization </li></ul></ul><ul><li>Practical applications </li></ul><ul><ul><li>Data mining in a taxonomic problem </li></ul></ul><ul><ul><li>Financial engineering </li></ul></ul>
  5. 5. Feature Selection Methods <ul><li>Filters: select features/variables by ranking them with correlation coefficients </li></ul><ul><li>Wrapper methods: assess the subset of features/variables based on their usefulness to a given classifier/estimator/predictor </li></ul><ul><li>Embedded methods: need to address </li></ul><ul><ul><li>how to search possible variable subsets </li></ul></ul><ul><ul><li>how to assess/score the performance </li></ul></ul><ul><ul><li>how to choose a learning machine that works particularly well with those selected features/variables </li></ul></ul>
  6. 6. Variable Selection in Regression <ul><li>We have n i.i.d. samples ( x 1 , y 1 ), …, ( x n , y n ) coming from the unknown distribution P ( x , y ) and we want to infer this statistical dependency. </li></ul><ul><li>A generic model is </li></ul><ul><ul><ul><li>y i = f ( x i ) + e i where e i are i.i.d. noise with unknown dist. </li></ul></ul></ul><ul><li>Regression aims to find f with finite samples. </li></ul><ul><li>f can be a generalized linear function </li></ul><ul><li>f ( x ) = w T ψ ( x ) = Σ i w i ψ ( x , ө i ) where ψ is a basis function </li></ul><ul><li>f can be an affine function </li></ul><ul><li>f ( x ) = Σ i w i k ( x i , x ) + b where k is a kernel function </li></ul>
  7. 7. Variable Selection in Regression <ul><li>For any “model” f , define the loss function </li></ul><ul><li>L ( y, f ( x , ө ) ) where ө is the set of free parameters in the model </li></ul><ul><li>We can choose ө that minimizes the empirical risk R ( ө ) = Σ i L ( y i , f ( x i , ө ) ) / n </li></ul><ul><li>Problem: “best-fit” leads to over-parameterized model </li></ul><ul><li>One needs to find the best model f based on certain model selection criterion </li></ul><ul><li>Within a given model f , one decides the best subset of nonzero free parameters  variable selection </li></ul>
  8. 8. Penalized Least Squares <ul><li>Assume the dimension of free parameter is d. Many variable selection criteria are closely related to minimize the following penalized least squares (PLS) </li></ul><ul><li>With l 0 penalty, PLS becomes </li></ul>
  9. 9. Penalized Least Squares <ul><li>Classical C p chooses m variables by minimizing RSS m /( n  m ) which is asymptotically equivalent to λ = σ / n 1/2 where σ is the standard deviation of the regression error. </li></ul><ul><li>Generalized cross-validation chooses m variables by minimizing RSS m /[ n (1 m / n ) 2 ] which is asymptotically equivalent to λ = σ /( n /2) 1/2 </li></ul><ul><li>Risk inflation criterion corresponds to λ =(2log d ) 1/2 σ /( n ) 1/2 </li></ul>
  10. 10. Penalized Likelihood <ul><li>If ( x i , y i ) has a probability density function p ( g ( x i T Θ ), y i ) with known inverse link function g , then we can define the penalized likelihood function </li></ul><ul><li>Maximizing the penalized likelihood can lead to sparse solution of the estimate  variable selection </li></ul><ul><li>Logistic regression belongs to this type </li></ul>
  11. 11. Discussion on l p Penalty in PLS <ul><li>The l p penalty with 0< p <2 yields bridge regression </li></ul><ul><li>The two extremes: l 0 penalty yields the best variable subset selection while l 2 penalty yields ridge regression without doing variable selection </li></ul><ul><li>When p ≤1, PLS and penalized likelihood can automatically perform variable selection </li></ul><ul><li>However, only l 1 penalty yields convex PLS that maintains sparseness of the solution (called LASSO) </li></ul><ul><li>Use both l 1 penalty and l 2 penalty  elastic net </li></ul>
  12. 12. Desired Properties of the Penalty Function in PLS <ul><li>Sparsity: only a small subset of the estimated coefficients are nonzero </li></ul><ul><li>Unbiasedness: When the true coefficient is large, the estimator should have small bias </li></ul><ul><li>Continuity: The estimator should be continuous to reduce the instability of variable selection </li></ul><ul><li>Computationally tractable: The estimation algorithm should have polynomial complexity </li></ul>
  13. 13. Finding A Good Penalty Function <ul><li>The l p penalty with 0 ≤ p <1 does not satisfy continuity condition </li></ul><ul><li>The l p penalty with 1 < p <2 does not satisfy sparsity condition </li></ul><ul><li>l 1 penalty yields convex PLS and sparse solution, however, it has a bias </li></ul><ul><li>There exist infinite many penalty functions that can satisfy those requirements, but the resulting PLS can not be convex </li></ul>
  14. 14. Finding A Good Penalty Function <ul><li>Use smoothly clipped absolute deviation penalty </li></ul><ul><li>We need p’ λ (| θ | ) to be singular at origin (sparsity condition) and p’ λ (| θ | )=0 as | θ |->∞ (unbiasedness condition) </li></ul><ul><li>Rule of thumb choice: a= 3.7 </li></ul>
  15. 15. Penalized Empirical Risk Minimization <ul><li>For machine learning problems, estimate unknown parameter is not of primary concern </li></ul><ul><li>One wants to minimize structural risk  empirical risk + penalty term </li></ul><ul><li>Σ i L ( y i , f ( x i , ө ) ) / n + Σ j p λ j ( | θ j | ) </li></ul><ul><li>Variable selection is important when ө has high dimensionality </li></ul><ul><li>Use the estimate of ө to compute the penalized risk </li></ul>
  16. 16. Penalized Empirical Risk Minimization <ul><li>Consistency issue: For large sample size </li></ul><ul><ul><li>Penalized risk converge to the true risk in probability </li></ul></ul><ul><ul><li>The nonzero locations are correctly identified with probability 1 </li></ul></ul><ul><ul><li>The penalized maximum likelihood estimate converges to the true parameter </li></ul></ul><ul><li>The technical conditions for consistency </li></ul><ul><li>The non-sparsity d should not grow higher than ( n /log n ) 1/2 </li></ul>
  17. 17. Computational Issue <ul><li>PLS or penalized likelihood may not be convex/concave for certain desired penalty functions </li></ul><ul><li>Use local quadratic approximation </li></ul><ul><li>Initial estimate can be LS or MLE without penalty </li></ul><ul><li>The iterative algorithm has Newton-Raphson type and convergence property similar to EM </li></ul><ul><li>For large d and small sparsity m , it can still be inefficient </li></ul>
  18. 18. Computational Issue <ul><li>For large d and small m , even linear program can be computationally expensive </li></ul><ul><li>Can formulate the variable selection problem as multiple hypothesis testing and control the false discovery rate  suboptimal but fast </li></ul><ul><li>Can recover the sparse parameter with less than d measurements, e.g., n = O ( m log d ) </li></ul><ul><li>It is related to signal representation with incoherent basis, an active research area called compressed sensing </li></ul>
  19. 19. Carpiodes cyprinus Quillback Large head; elongate body; elongate snout; no lip nipple. Carpiodes carpio River carpsucker Small head; elongate body; short snout; lip nipple. Carpiodes velifer Highfin carpsucker Small head; short, deep body; very short snout; lip nipple. Application: Taxonomic Revision Joint work with Yixin Chen, Hank Bart, Shuqing Huang
  20. 20. Body shape variation in Carpiodes (landmark data) Carpiodes carpio Carpiodes cyprinus Carpiodes velifer
  21. 21. Tamura-Nei distance GH Intron data Average divergence = 1.4% DNA sequence variation C. cf velifer Choctawhatchee R. C. cf velifer Escambia R. C. cf velifer Savannah R. C. cf velifer Cahaba R. C. cf velifer Broad R. C. cf velifer Pearl R. C. carpio Sabine R. C. carpio Red R. C. velifer Upper Miss. (heterozygote) C. carpio Arkansas R. (heterozygote) C. cyprinus Hudson Bay (heterozygote) C. cf cyprinus Apalachicola C. cf cyprinus Santee R. (heterozygote) C. cyprinus Hudson Bay (heterozygote?) C. cyprinus Hudson Bay (heterozygote) C. cf cyprinus Altamaha (heterozygote) C. carpio Colorado R. (heterozygote) C. cyprinus Great Lakes (heterozygote) C. cyprinus Upper Miss (heterozygote) C. cf cyprinus Escambia R. (heterozygote) C. cf cyprinus Pearl R. C. cf cyprinus Choctawhatchee R. C. cf cyprinus Pascagoula R. Ictiobus cyprinellus Upper Miss. (outgroup)
  22. 22. Joint feature selection and classification <ul><li>Use logistic regression classifier (LRC) with controlled false discovery rate (FDR) for feature selection based on landmark data and 1-norm SVM can resolve the issue of 53 undetermined specimens previously misdiagnosed as carpio </li></ul>
  23. 23. Application: Financial Engineering Looking for collaboration and support <ul><li>Let Y i be the excess return of the i -th asset over the risk-free asset </li></ul><ul><li>Let f 1 , …, f K be the factors that influence the return of the market </li></ul><ul><li>In vector notation, the regression model is </li></ul><ul><ul><ul><li>Y=Bf+e with unknown noise variance </li></ul></ul></ul><ul><li>Given d stocks over n periods, we want to pick the stocks that maximize the excess return with controlled covariance matrix </li></ul>
  24. 24. Penalized Least Squares with Controlled Covariance <ul><li>Apply LS to each stock, we can estimate the regression coefficient matrix B and error variance </li></ul><ul><li>The covariance of the excess return is </li></ul><ul><li>Var( Y )= B Var( f ) B T +Var( e ) </li></ul><ul><li>We assume Var( e ) is a diagonal matrix with each diagonal element given by the regression error variance and Var( f ) is the sample covariance </li></ul><ul><li>Apply PLS can achieve automatic factor selection </li></ul><ul><li>Apply modified PCA can select stocks with controlled covariance matrix </li></ul>
  25. 26. Implications to Stock Investment <ul><li>Constructing factors that influence market itself is a high dimensional model selection problem with massive trading data (now at the resolution of a sec) </li></ul><ul><li>The selected stocks may not be good in terms of future return  need to use prediction model </li></ul><ul><li>Important factors can be identified even for day trading purpose  real time forecast is still hard </li></ul><ul><li>Simulation and real money experiment: over 10% return in three months, but a lot of work not done yet </li></ul>
  26. 27. Example: Prediction is hard ?
  27. 28. Example: Prediction is hard
  28. 29. Example: Prediction is hard
  29. 30. Summary <ul><li>Statistical feature/variable selection can be viewed in a unified framework as maximizing penalized likelihood or minimizing penalized least squares </li></ul><ul><li>Desired penalty should yield sparse , unbiased and continuous solution for the selected features </li></ul><ul><li>Local quadratic approximation is useful to solve the resulting optimization problem </li></ul><ul><li>Special algorithm is needed for very sparse feature selection problem; a few samples are adequate for reconstruction purpose </li></ul><ul><li>Many applications in data mining and financial engineering require to do feature selection and dimension reduction </li></ul>