Intro to Model Selection


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Intro to Model Selection

  1. 1. Introduction to Statistical Model Selection Huimin Chen Department of Electrical Engineering University of New Orleans New Orleans, LA 70148
  2. 2. Typical Problem <ul><li>Model: explaining available data, predicting new observations … </li></ul><ul><li>But we do not know the real model </li></ul><ul><ul><li>Unsure about true data generation mechanism </li></ul></ul><ul><ul><li>Unsure of which predictors are useful </li></ul></ul><ul><li>Model set can be huge or even with infinite elements </li></ul><ul><li>Have to narrow down to statistical models </li></ul>
  3. 3. Modeling and Model Selection <ul><li>Variable selection </li></ul><ul><ul><li>Which factors are important </li></ul></ul><ul><ul><li>What statistical dependencies are significant </li></ul></ul><ul><ul><li>Most problems are learning from data </li></ul></ul><ul><li>Bias-Variance Tradeoff </li></ul><ul><ul><li>Key is to understand/interpret penalty terms </li></ul></ul><ul><ul><li>Goodness of fit vs. model complexity </li></ul></ul><ul><li>Other computational issues </li></ul><ul><ul><li>Dimension reduction, optimization … </li></ul></ul>
  4. 4. Outline of This Talk <ul><li>Formulation of statistical model selection </li></ul><ul><li>General design criteria for model selection </li></ul><ul><ul><li>Minimizing predictive risk </li></ul></ul><ul><ul><li>Bayesian methods </li></ul></ul><ul><ul><li>Information theoretic measure </li></ul></ul><ul><ul><li>Adaptive methods </li></ul></ul><ul><li>From model selection to model evaluation </li></ul><ul><ul><li>What model offers the best guaranteed predictive performance? </li></ul></ul>
  5. 5. Regression Model <ul><li>We have n i.i.d. samples ( x 1 , y 1 ), … ( x n , y n ) coming from the unknown distribution P( x , y ) and we want to infer this statistical dependency. </li></ul><ul><li>A generic model is </li></ul><ul><ul><ul><li>y i = f ( x i ) + e i where e i are i.i.d. noise with unknown dist. </li></ul></ul></ul><ul><li>Regression aims to find f with finite samples. </li></ul><ul><li>f can be a generalized linear function </li></ul><ul><li>f ( x ) = w T ψ ( x ) = Σ i w i ψ ( x , ө i ) where ψ is a basis function </li></ul><ul><li>f can be an affine function </li></ul><ul><li>f ( x ) = Σ i w i k ( x i , x ) + b where k is a kernel function </li></ul>
  6. 6. Empirical Risk Functional <ul><li>For any “model” f , define the loss function </li></ul><ul><li>L( y, f ( x , ө ) ) where ө is the set of free parameters in the model </li></ul><ul><li>We can choose ө that minimizes the empirical risk R( ө ) = Σ i L( y i , f ( x i , ө ) ) / n </li></ul><ul><li>Problem: “best-fit” leads to over-parameterized model </li></ul><ul><li>One needs a measure to control the model complexity </li></ul><ul><li>Vapnik suggested to use structural risk minimization </li></ul>
  7. 7. How to measure the model complexity with finite data? <ul><li>Occam’s Razor  Various penalty terms </li></ul><ul><ul><li>Use predictive risk </li></ul></ul><ul><ul><li>Use Bayesian model selection </li></ul></ul><ul><ul><li>Use Information theoretic measures: AIC, MDL … </li></ul></ul><ul><ul><li>Use regularization: SRM, ridge regression, … </li></ul></ul><ul><li>Finite data evaluation  Manipulate samples </li></ul><ul><ul><li>Bootstrap, surrogated data </li></ul></ul><ul><ul><li>Cross Validation </li></ul></ul><ul><ul><li>Boosting, bagging, … </li></ul></ul>
  8. 8. Predictive Risk <ul><li>Use new samples to obtain prediction error </li></ul><ul><ul><li>R( ө ) = Σ i L( y i , f ( x i , ө ) ) / m where ( x i , y i ) are not used to estimate f and/or ө </li></ul></ul><ul><li>Projection of error  approximation error + estimation error (bias vs. variance) </li></ul><ul><li>Linear regression: Y = X β + ε </li></ul><ul><li>Given a model with p covariates, find an unbiased estimate of the (prediction) MSE </li></ul>
  9. 9. Predictive Risk (Cont’d) <ul><li>Calculate the residue squared sum RSS ( p ) and unbiased estimates </li></ul><ul><ul><li>σ ^2 = RSS ( p ) / ( n − p ) </li></ul></ul><ul><ul><li>MSE = [ RSS ( p ) + 2 p σ ^2] / n </li></ul></ul><ul><li>Model complexity (Mallow) is C( p ) ≈ p </li></ul><ul><li>Issues </li></ul><ul><ul><li>Consistency  asymptotically overfit </li></ul></ul><ul><ul><li>Effective bias  estimate σ ^2 assuming p << n </li></ul></ul><ul><ul><li>Hard to apply in problems other than regression </li></ul></ul>
  10. 10. Predictive Risk (Cont’d) <ul><li>A generalization of final prediction error (FPE) criterion  compare the approximation accuracy to true model </li></ul><ul><li>Akaike’s Information Criterion (AIC) </li></ul><ul><ul><li>Add 2 p σ ^2 to RSS ( p )  C( p ) ≈ p for linear regression </li></ul></ul><ul><li>Minimize unbiased estimate of divergence via penalized log-likelihood  will get back to this later in information theoretic criterion </li></ul><ul><li>Issue: Estimate of relative entropy for model with smallest penalized log-likelihood is no longer unbiased </li></ul>
  11. 11. Predictive Risk (Cont’d) <ul><li>Estimate out-of-sample prediction error directly (Stone)  Cross Validation </li></ul><ul><li>Simplified calculation by leave-one-out </li></ul><ul><li>C( p ) ≈ p (1+ p / n ) </li></ul><ul><li>A retrospect to history </li></ul><ul><ul><li>Fisher’s (Pearson’s?) chi-square test: RSS( p ) / σ ^2 </li></ul></ul><ul><ul><li>Less conservative procedure: Bonferroni’s RIC  FDR control? </li></ul></ul><ul><ul><li>Multiple testing (Simes)  step-up, step-down tests </li></ul></ul>
  12. 12. Bayesian Model Selection <ul><li>Bayes formula </li></ul><ul><li>prior × likelihood = posterior × evidence </li></ul><ul><li>Example: Ridge regression has a Bayesian interpretation </li></ul><ul><ul><li>Bayes hierarchical model (Lindley and Smith) </li></ul></ul><ul><ul><li>Posterior mean shrinks to 0 as c  0 </li></ul></ul>
  13. 13. Bayesian Model Selection (Cont’d) <ul><li>Think only the evidence of model rather than the parameter contained in the model </li></ul><ul><li>The posterior is obtained using marginalized likelihood </li></ul><ul><li>Choose the model that has the maximum a posteriori probability </li></ul>
  14. 14. Bayesian Model Selection (Cont’d) <ul><li>Bayes factor for comparing two models </li></ul><ul><li>Approximate Bayes factor (Laplace’s method) </li></ul>Define so that Use quadratic approximation Log-likelihood ≈ −
  15. 15. Bayesian Model Selection (Cont’d) <ul><li>Empirical Bayesian: try to find sensible prior for unknown parameter of each model </li></ul><ul><li>Good news: For each model, both evidence and posterior distribution of the unknown parameter can be inferred </li></ul><ul><li>Bad news: </li></ul><ul><ul><li>The choice of prior may affect model selection </li></ul></ul><ul><ul><ul><li>Conjugate prior, empirical, noninformative, … </li></ul></ul></ul><ul><ul><li>The marginalization can be computationally intensive </li></ul></ul><ul><ul><ul><li>Often needs sampling based methods, e.g., MCMC </li></ul></ul></ul>
  16. 16. Information Theoretic Measures <ul><li>Model selection = data compression </li></ul><ul><li>Compressed data are represented by a two-part code : encoding model parameter + encoding data with the help of model </li></ul><ul><li>Model selection criteria differ mainly in how they encode the parameters </li></ul><ul><li>The ultimate goal is to have a universal criterion for model selection: a fundamental problem in learning theory </li></ul>
  17. 17. Information Theoretic Measures (Cont’d) <ul><li>Encoding parameter in linear regression </li></ul><ul><ul><li>Bayesian Information Criterion (BIC) </li></ul></ul><ul><ul><li>( n /2)log RSS ( p ) + (p /2)log n </li></ul></ul><ul><ul><li>Stochastic Information Criterion (SIC) </li></ul></ul><ul><ul><li> ( n /2)log RSS ( p ) + ( p /2)log SIC ( p ) </li></ul></ul><ul><ul><li> SIC ( p ) = [Y’Y −RSS( p )]/ p </li></ul></ul><ul><ul><li>Akaike’s Information Criterion (AIC) </li></ul></ul><ul><ul><li> ( n /2)log RSS ( p ) + p </li></ul></ul><ul><ul><li>Minimum Description Length (MDL) </li></ul></ul><ul><ul><li>( n /2)log[ RSS ( p )/( n − p )] + ( p /2)log SIC ( p ) </li></ul></ul>
  18. 18. Information Theoretic Measures (Cont’d) <ul><li>So many choices − is any one right? </li></ul><ul><li>Universal modeling  MDL principle </li></ul><ul><ul><li>There is a 1-1 correspondence between probability distributions and code length functions such that small probabilities corresponds to large code lengths and vice versa. </li></ul></ul><ul><ul><li>Encode a sequence x 1 , x 2 , …, x n with minimum number of bits which is universally good for a family of distributions </li></ul></ul><ul><ul><ul><li>Model = Single distribution </li></ul></ul></ul><ul><ul><ul><li>Model class = family of distributions </li></ul></ul></ul>
  19. 19. MDL Principle <ul><li>Universal model  minimizing the worst case regret </li></ul><ul><li>The optimal distribution is normalized maximum likelihood (NML) </li></ul><ul><li>The “best” model that could explain the sequence </li></ul>
  20. 20. MDL Principle (Cont’d) <ul><li>Under certain regularity conditions, MDL principle minimizes </li></ul><ul><li>The MDL criterion is invariant under reparameterization of ө  both dimension and curvature of the parameter matter </li></ul>
  21. 21. MDL Principle (Cont’d) <ul><li>Bayes and NML becomes indistinguishable if Jefferys’ prior is chosen </li></ul><ul><li>Jefferys’ prior is uniform not on parameter space but on the space of distributions with the “natural metric” (Fisher) that measures the distance between distinguishable distributions </li></ul><ul><li>For large n , Bayesian predictive distribution concentrates more and more around the ML distribution </li></ul>
  22. 22. MDL Principle (Cont’d) <ul><li>Other competitive MDL criteria </li></ul><ul><ul><li>Minimum message length (MML) by Wallace </li></ul></ul><ul><ul><li>MDL by prequential validation (Dawid) </li></ul></ul><ul><ul><li>Predictive MDL (Yu, Barron) </li></ul></ul><ul><li>Computational Issues </li></ul><ul><ul><li>Never have to do any real coding </li></ul></ul><ul><ul><li>MLE of ө is usually easy to compute </li></ul></ul><ul><ul><li>The normalization term in NML is hard to compute </li></ul></ul><ul><ul><li>Be careful with the asymptotic expansion O (1) </li></ul></ul>
  23. 23. MDL Principle (Cont’d) <ul><li>Major difficulties of MDL criterion </li></ul><ul><ul><li>Comparing infinite many models with finite data  the allowable model set is hard to determine </li></ul></ul><ul><ul><li>Also requiring parameter estimation  encoding with quantization </li></ul></ul><ul><ul><li>Unbounded model complexity term  NML does not exist </li></ul></ul><ul><li>Possible extensions </li></ul><ul><ul><li>Selecting the best model set </li></ul></ul><ul><ul><li>Selecting the best model for classification </li></ul></ul><ul><ul><li>Connecting information extraction to the foundation of learning theory </li></ul></ul>
  24. 24. MDL Principle (Cont’d) <ul><li>MDL for model set selection </li></ul><ul><ul><li>Model set and mixture distributions can also be encoded </li></ul></ul><ul><ul><li>The difficulty arises in computational aspect rather than MDL criterion itself </li></ul></ul><ul><li>MDL vs. Bayesian </li></ul><ul><ul><li>Bayesian: prior represents degrees of belief in different state of nature; true distribution has a nonzero probability measure </li></ul></ul><ul><ul><li>MDL: No such thing as a true distribution; inductive learning only based on the regularities in data which will be present in future coming from the same phenomenon </li></ul></ul>
  25. 25. MDL Principle (Cont’d) <ul><li>MDL for classification </li></ul><ul><ul><li>Model selection for the family of classifiers: decision trees, support vector machines, neural networks … </li></ul></ul><ul><ul><li>Problem with MDL: strange experimental results (Kearns), can be inconsistent (Grunwald) </li></ul></ul><ul><ul><li>Problem not just for MDL, but also for Bayesian classification with misspecification </li></ul></ul><ul><ul><li>Consistent approaches, e.g., PAC-Bayes, that ensures no asymptotic overfit; do not have coding interpretation </li></ul></ul>
  26. 26. MDL Principle (Cont’d) <ul><li>MDL for inductive learning </li></ul><ul><ul><li>Major concern: consistency/rate of convergence  no result comparable to Vapnik’s statistical learning theory </li></ul></ul><ul><ul><li>Rissanen’s extreme position: </li></ul></ul><ul><ul><ul><li>The assumption that there exists a probability distribution generating data is untenable in many applications. </li></ul></ul></ul><ul><ul><ul><li>Statistical inference based on the assumption that a true distribution exists and seeking to obtain this distribution as fast as possible is methodologically flawed . </li></ul></ul></ul><ul><ul><ul><li>Model selection should be based on the properties of data alone (cf. The Computer Journal, 42(4), 1999). </li></ul></ul></ul><ul><ul><ul><li>Nevertheless, if true distribution does exist and is in the model set, the method better finds it given enough data. </li></ul></ul></ul>
  27. 27. MDL Principle (Cont’d) <ul><li>MDL in practice </li></ul><ul><ul><li>For probability models: Yes </li></ul></ul><ul><ul><ul><li>MDL and Bayes give similar results but different justifications. </li></ul></ul></ul><ul><ul><ul><li>Even helpful in explaining things in cognitive psychology </li></ul></ul></ul><ul><ul><li>For general loss functions and predictors: application dependent, not well-developed yet </li></ul></ul><ul><ul><ul><li>Closely related to universal prediction (Merhav) </li></ul></ul></ul><ul><ul><ul><li>Have been applied to regression, time series, clustering, … </li></ul></ul></ul><ul><ul><ul><li>Still too many design parameters rather than the universal coding interpretation </li></ul></ul></ul><ul><ul><ul><li>Possibly using worst-case expected regret rather than actual regret  second kind universal model (Barron) </li></ul></ul></ul>
  28. 28. Statistical Regularization <ul><li>Nonparametric approach indeed </li></ul><ul><li>R( f ) = Σ i L( y i , f ( x i ) ) / n + λ ||Af|| L ( x ) </li></ul><ul><li>f can be any regression function (no need to have parameter ө ); A is an operator and L is the Hilbert space of square integrable functions on x with a proper measure </li></ul><ul><li>Only need to work on reproduced kernel Hilbert space (RKHS) </li></ul><ul><li>( n λ I + K ) c = Y  f ( x ) = Σ i c i K ( x , x i ) </li></ul><ul><li>where K is the kernel function and K is a symmetric positive definite matrix </li></ul>
  29. 29. Statistical Regularization (Cont’d) <ul><li>Best choice for regularization parameter λ (Cucker and Smale)  Unique solution λ * exists for a compact subspace that minimizes the approximation error to true f * </li></ul><ul><li>Can be interpreted as the best tradeoff between sample complexity and hypothesis space complexity </li></ul><ul><li>In statistics  Regularized nonparametric least squares regression </li></ul><ul><li>Bayesian interpretation: Use prior P( f )=exp( − λ ||Af|| )/ Z </li></ul><ul><li>Closely related to Gaussian process model (MacKay) </li></ul>
  30. 30. Adaptive Methods <ul><li>No longer universal  data driven semi-parametric, e.g., multiple models </li></ul><ul><li>Need new metrics  Given a set of models, want to treat each model as a point in a complex high-dimensional space with well defined distance measure </li></ul><ul><li>Need adaptive algorithms to generate and combine models </li></ul><ul><ul><li>Deterministic: greedy, boosting, … </li></ul></ul><ul><ul><li>Stochastic: stochastic search, m -fold cross validation, … </li></ul></ul>
  31. 31. Adaptive Methods (Cont’d) <ul><li>A lot of techniques but no unique principle </li></ul><ul><ul><li>Bayesian: empirical, model averaging, … </li></ul></ul><ul><ul><li>Boosting: combining classifiers by voting (soft selection) </li></ul></ul><ul><ul><li>Bagging: using bootstrap samples to improve predictive accuracy (for small data size) </li></ul></ul><ul><ul><li>Subset selection </li></ul></ul><ul><ul><ul><li>Efficient partitioning of training and testing data </li></ul></ul></ul><ul><ul><ul><li>Efficient combining algorithms working on different data </li></ul></ul></ul><ul><li>Using empirical complexity, e.g., VC dimension </li></ul>
  32. 32. From model selection to model evaluation <ul><li>Model fitting: mimic the structure of data </li></ul><ul><li>Model testing: goodness-of-fit </li></ul><ul><li>Model selection: bias-variance tradeoff </li></ul><ul><li>Model evaluation: </li></ul><ul><ul><li>Use data (partially) different from those used in model fitting and selection </li></ul></ul><ul><ul><li>Choose the best subset of data for model fitting and model evaluation </li></ul></ul><ul><ul><li>Create an innovative way to enlarge the data set, e.g., using surrogated data </li></ul></ul>
  33. 33. From model selection to model evaluation (Cont’d) <ul><li>Issues in model evaluation </li></ul><ul><ul><li>No clear picture on performance vs. data manipulation </li></ul></ul><ul><ul><ul><li>A model works well for all available data  no guarantee for the performance on future data </li></ul></ul></ul><ul><ul><ul><li>Consistency issue comes from lacking statistical assumption </li></ul></ul></ul><ul><ul><ul><li>Remember in MDL principle we try to abandon the assumption of true underlying distribution </li></ul></ul></ul><ul><ul><li>No sharp bound on generalization error for general loss functions </li></ul></ul><ul><ul><ul><li>VC bounds are generally conservative for practical problems </li></ul></ul></ul><ul><ul><ul><li>The hypothesis space should concentrate on those “typical” models or families of distributions </li></ul></ul></ul>
  34. 34. Summary <ul><li>Information theory provides a unique angle to view general machine learning and statistical model selection problems via MDL principle. </li></ul><ul><li>Bayesian interpretation of a model is appealing whenever possible. </li></ul><ul><li>A lot of model selection criteria exist </li></ul><ul><ul><li>MDL principle is simple, but implementation is hard </li></ul></ul><ul><ul><li>Cross validation is also important </li></ul></ul><ul><li>Practical applications will be covered later </li></ul>
  35. 35. Future Research Directions <ul><li>Good model class that works on a small number of labeled pairs but a large number of unlabeled ones </li></ul><ul><li>Measure of model complexity involving models with hierarchical structures (e.g., decision trees, or even human brains) </li></ul><ul><li>A model selection principle that unifies information theory and statistical learning theory </li></ul><ul><li>Efficient algorithm that can do model selection for a large number of model sets </li></ul>
  36. 36. Further Readings <ul><li>J. Rissanen. Stochastic Complexity in Statistical Inquiry . World Scientic, River Edge, NJ, 1989. </li></ul><ul><li>T. Hastie, et al. The Elements of Statistical Learning . Springer, 2001. </li></ul><ul><li>V. N. Vapnik. Statistical Learning Theory . Wiley, New York, 1998. </li></ul><ul><li>Z. Zhao, H. Chen, and X. R. Li. “Semiparametric Model Selection with Applications to Regression”, Proc. 2005 IEEE Workshop on Statistical Signal Processing , Bordeaux, France, July 2005. </li></ul><ul><li>H. Chen, Y. Bar-Shalom, K. R. Pattipati, and T. Kirubarajan. “MDL Approach for Multiple Low Observable Track Initiation”, IEEE Trans. Aerospace and Electronic Systems , AES-39(3):862-882, Jul. 2003. </li></ul>