GKEL_IGARSS_2011.ppt

459 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
459
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GKEL_IGARSS_2011.ppt

  1. 1. Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Prudhvi Gurram, Heesung Kwon Image Processing Branch U.S. Army Research Laboratory
  2. 2. Outline <ul><li>Current Issues </li></ul><ul><li>Sparse Kernel-Based Ensemble Learning (SKEL) </li></ul><ul><li>Generalized Kernel-Based Ensemble Learning </li></ul><ul><li>(GKEL) </li></ul><ul><li>Simulation Results </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Current Issues Sample Hyper spectral Data (Visible + near IR, 210 bands) Grass Military vehicle <ul><li>High dimensionality of hyperspectral data vs. </li></ul><ul><li>Curse of dimensionality </li></ul><ul><li>Small set of training samples (small targets) </li></ul><ul><li>The decision function of a classifier is over fitted </li></ul><ul><li>to the small number of training samples </li></ul><ul><li>Idea is to find the underlying discriminant structure </li></ul><ul><li>NOT the noisy nature of the data </li></ul><ul><li>Goal is to regularize the learning to make </li></ul><ul><li>decision surface robust to noisy samples and outliers </li></ul><ul><li>Use Ensemble Learning </li></ul>
  4. 4. Training Data SVM 1 Decision Surface f 1 Kernel–based Ensemble Learning (Suboptimal technique) Random Subsets of spectral bands Ensemble Decision SVM 2 Decision Surface f 2 SVM 3 Decision Surface f 3 SVM N Decision Surface f N Majority Voting Sub-classifiers Used: Support Vector Machine (SVM) Random subsets of spectral bands <ul><li>Idea is not all the subsets are </li></ul><ul><li>useful for the given task </li></ul><ul><li>So select a small number of </li></ul><ul><li>subsets useful for the task </li></ul>
  5. 5. Training Data Random Subsets of Features (random bands) Combined Kernel Matrix SVM 2 SVM N Sparse Kernel-based Ensemble Learning (SKEL) <ul><li>To find useful subsets, developed SKEL built on the idea of multiple kernel learning (MKL) </li></ul><ul><li>Jointly optimizes the SVM-based sub-classifiers in conjunction with the weights </li></ul><ul><li>In the joint optimization, the L1 constraint is imposed on the weights to make them sparse </li></ul>SVM 1 SVM 2 Optimal subsets useful for the given task
  6. 6. Optimization Problem Optimization Problem (Multiple Kernel Learning, Rakotomamonjy at al) : L1 norm Sparsity
  7. 7. <ul><li>SKEL </li></ul><ul><ul><li>SKEL is a useful classifier with improved performance </li></ul></ul><ul><ul><li>However, some constraints in using SKEL </li></ul></ul><ul><ul><li>SKEL has to use a large number of initial SVMs to maximize the ensemble performance causing a memory error due to the limited memory size </li></ul></ul><ul><ul><li>The numbers of features selected for all the SVMs have to be the same </li></ul></ul><ul><ul><li>also causing sub-optimality in choosing feature subspaces </li></ul></ul><ul><li>GKEL </li></ul><ul><ul><li>Relaxes the constraints of SKEL </li></ul></ul><ul><ul><li>Uses a bottom-up approach, starting from a single classifier, sub-classifiers are added one by one until the ensemble converges, while a subset of features is optimized for each sub-classifier. </li></ul></ul>Generalized Sparse Kernel-based Ensemble (GKEL)
  8. 8. Sparse SVM Problem <ul><li>GKEL is built on the sparse SVM problem* that finds optimal sparse features </li></ul><ul><li>maximizing the margin of the hyperplane, </li></ul>Primal optimization problem: * Tan et al, “Learning sparse SVM for feature selection on very HD datasets,” ICML 2010 <ul><li>Goal is to find an optimal resulting in optimal </li></ul><ul><li>that maximizes the margin of the hyperplane </li></ul>
  9. 9. Dual Problem of Sparse SVM <ul><li>Using Lagrange multipliers and the KKT conditions, the primal problem </li></ul><ul><li>can be converted to the dual problem </li></ul><ul><li>The mixed integer programming problem is NP hard </li></ul><ul><li>Since there are a large number of different combinations of sparse </li></ul><ul><li>features, the number of possible kernel matrices is huge </li></ul><ul><li>Combinatorial Problem !!! </li></ul>
  10. 10. Relaxation into QCLP <ul><li>To make the mixed integer problem tractable, relax it into Quadratically </li></ul><ul><li>Constrained Linear Programming (QCLP) </li></ul><ul><li>The objective function is converted into inequality constraints </li></ul><ul><li>lower bounded by a real value </li></ul><ul><li>Since the number of possible is huge, so is the number of the </li></ul><ul><li>constraints , therefore it’s still hard to solve the QCLP problem </li></ul><ul><li>But, among many constraints, most of the constraints are not actively </li></ul><ul><li>used to solve the optimization problem </li></ul><ul><li>Goal is to find a small number of constraints that are actively used </li></ul>
  11. 11. Illustrative Example (Yisong Yue, “ Diversified Retrieval as Structured Prediction, ” ICML 2008) <ul><li>Use a technique called the restricted master problem that finds the active </li></ul><ul><li>constraints by identifying the most violated constraints one by one iteratively </li></ul><ul><li>Find the first most violated constraint </li></ul><ul><li>Suppose an optimization problem with a large number of inequality constraints (SVM) </li></ul><ul><li>Among many constraints, most of the constraints in the problem are not used to find the feasible </li></ul><ul><li>region and an optimal solution </li></ul><ul><li>Only a small number of active constraints are used to fine the feasible region </li></ul>
  12. 12. <ul><li>Use the restricted master problem that finds the most violated constraints </li></ul><ul><li>(features) one by one iteratively </li></ul><ul><li>Find the first most violated constraint </li></ul><ul><li>Based on previously found constraints, find the next most violated constraint </li></ul>(Yisong Yue, “ Diversified Retrieval as Structured Prediction, ” ICML 2008)
  13. 13. <ul><li>Use the restricted master problem that finds the most violated constraints </li></ul><ul><li>(features) one by one iteratively </li></ul><ul><li>Find the first most violated constraint </li></ul><ul><li>Based on previously found constraints, find the next one </li></ul><ul><li>Continue the iterative search until no violated constraints are found </li></ul>(Yisong Yue, “ Diversified Retrieval as Structured Prediction, ” ICML 2008)
  14. 14. <ul><li>Use the restricted master problem that finds the most violated constraints </li></ul><ul><li>(features) one by one iteratively </li></ul><ul><li>Find the first most violated constraint </li></ul><ul><li>Then the next one </li></ul><ul><li>Continue until no violated constraints are found </li></ul>(Yisong Yue, “ Diversified Retrieval as Structured Prediction, ” ICML 2008)
  15. 15. Flow Chart Yes No Terminate <ul><li>Flow chart of the QCLP problem based on the restricted master problem </li></ul>
  16. 16. Most Violated Features <ul><li>Linear Kernel </li></ul><ul><li>- Calculate for each feature separately and select </li></ul><ul><li>features with top values </li></ul><ul><li>- Does not work for non-linear kernels </li></ul><ul><li>Non-linear Kernel </li></ul><ul><li>- Individual feature ranking no longer works because it exploits non-linear </li></ul><ul><li>correlations among all the features (e.g. Gaussian RBF kernel) </li></ul><ul><li>- Calculate where being all the features </li></ul><ul><li>except feature, </li></ul><ul><li>- Eliminate the least contributing feature </li></ul><ul><li>- Repeat elimination until threshold condition is met (e.g. if change in </li></ul><ul><li>exceeds 30% then stop the iteration) </li></ul><ul><li>- Variable length features for different SVMs </li></ul>
  17. 17. How GKEL Works SVM 1 SVM 3 SVM 2 SVM N
  18. 18. Images for Performance Evaluation Forest Radiance I Desert Radiance II Hyperspectral Images (HYDICE) (210 bands, 0.4 – 2.5 microns) : Training samples
  19. 19. Performance Comparison (FR I) Single SVM SKEL (10 to 2 SVMs) GKEL (3 SVMs) (Gaussian kernel) (Gaussian kernel) (Gaussian kernel)
  20. 20. ROC Curves (FR I) <ul><li>Since each SKEL run uses different random subsets of spectral bands, 10 SKEL runs were </li></ul><ul><li>used to generate 10 ROC curves </li></ul>
  21. 21. Performance Comparison (DR II) Single SVM GKEL (3 SVMs) SKEL (10 to 2 SVMs) (Gaussian kernel) (Gaussian kernel) (Gaussian kernel)
  22. 22. Performance Comparison (DR II) <ul><li>10 ROC curves from 10 SKEL runs, each run with different random subsets of spectral bands </li></ul>
  23. 23. Performance Comparison SKEL: Initial SVMs: 25 After optimization: 12 GKEL: SVMs with nonzero weights: 14 <ul><li>Data downloaded from the UCI machine learning database called </li></ul><ul><li>Spambase data used to predict whether an email is spam or not </li></ul>Spambase Data
  24. 24. Conclusions <ul><li>SKEL and a generalized version of SKEL have been introduced </li></ul><ul><li>SKEL starts from a large number of initial SVMS and then is optimized to a </li></ul><ul><li>small number of SVMs useful for the given task </li></ul><ul><li>GKEL starts from a single SVM and Individual classifiers are added one by one </li></ul><ul><li>optimally to the ensemble until the ensemble converges </li></ul><ul><li>GKEL and SKEL performs generally better than regular SVM </li></ul><ul><li>GKEL performs as good as SKEL while using less resources (memory) than </li></ul><ul><li>SKEL </li></ul>
  25. 25. Q&A ?
  26. 26. <ul><li>Prior to the L1 optimization, kernel parameters of each SVM are optimally tuned. </li></ul><ul><li>Gaussian kernel with single bandwidth has been used treating all the bands equally - suboptimal </li></ul><ul><li>Estimate the upper bound to Leave-one-out (LOO) error (the Radius-Margin bound) </li></ul><ul><li>Goal is to minimize the RM bound using the gradient descent technique </li></ul>: Full-band diagonal Gaussian kernel the radius of the minimum enclosing hypersphere The margin of the hyperplane Optimally Tuning Kernel Parameters
  27. 27. Ensemble Learning Sub-classifier 1 Sub-classifier 2 Sub-classifier N Regularized Decision Function (Robust to noise and outliers) Ensemble decision -1 -1 1 <ul><li>The performance of each classifier </li></ul><ul><li>is better than random guess and </li></ul><ul><li>independent each other </li></ul><ul><li>By increasing the number </li></ul><ul><li>of classifiers performance </li></ul><ul><li>is improved. </li></ul>
  28. 28. Training Data Random Subsets of Features (random bands) Combination of decision results SVM 2 SVM N SKEL : Comparison (Top-Down Approach) SVM 1 SVM 2
  29. 29. <ul><li>Iteratively update </li></ul><ul><li>constraints. </li></ul>based on a limited number of active Iterative Approach to Solve QCLP <ul><li>Due to a very large number of quadratic constraints, the subject </li></ul><ul><li>QCLP problem is hard to solve. </li></ul><ul><li>So, take iterative approach </li></ul>
  30. 30. Each Iteration of QCLP <ul><li>The intermediate solution pair is therefore obtained from </li></ul>
  31. 31. Iterative QCLP vs. MKL
  32. 32. Variable Length Features <ul><li>Applying threshold to </li></ul><ul><li>variable length features </li></ul><ul><li>Stop iterations when the portion of the 2-norm of w from the </li></ul><ul><li>least contributing features exceeds the predefined TH </li></ul>(e.g. 30%) leads to
  33. 33. GKEL Preliminary Performance Chemical Plume Data SKEL: Initial SVMs: 50 After optimization: 8 GKEL: SVMs with nonzero weights: 7 (22)
  34. 34. Relaxation into QCLP
  35. 35. QCLP
  36. 36. L1 and Sparsity Linear inequality constraints L2 Optimization L1 Optimization

×