2009 asilomar


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

2009 asilomar

  1. 1. Active Learning schemes for Reduced Dimensionality Hyperspectral Classification Vikram Jayaram, Bryan Usevitch Dept. of Electrical & Computer Engineering The University of Texas at El Paso 500 W. University Ave, El Paso, Texas 79968-0523 {jayaram, usevitch }@ece.utep.edu ∗ Abstract basis. Concerning the second problem, feature extraction and optimal band selection are the methods most commonly used for finding useful features in high-dimensional data. On the other hand, reduced dimensionality algorithms suffer from theoretical loss of performance. This performance loss occurs due to reduction of data to features, and further approximating the theoretical features to PDFs. Although, a loss of performance is eminent in case of PDF based classification methods, their evaluation in an HSI classification regime is the prime focus of this paper. Statistical schemes have certain advantages which promote their use in various pattern recognition problems. In this paper, we study the application of two statistical learning criteria for material classification of Hyperspectral remote sensing data. In most cases, the Hyperspectral data is characterized using a Gaussian mixture model (GMM). The problem in using statistical model such as the GMM is the estimation of class conditional probability density functions based on the exemplar available from the training data for each class. We demonstrate the usage of two training methods - dynamic component allocation (DCA) and the minimum message length (MML) criteria that are employed to learn the mixture observations. The training schemes are then evaluated using the Bayesian classifier. 2 Mixture Modeling In order to define the relevance of using PDF finite mixture model for HSI, let us consider a random variable X, the finite mixture models decompose a PDF f (x) into sum of K class PDFs. In other words, the density function f (x) is semiparametric, since it may be decomposed into K components. Let fk (x) denote the k th class PDF. The finite mixture model with K components expands as 1 Introduction Classification of Hyperspectral imagery (HSI) data is a challenging problem for two main reasons. First, with limited spatial resolution of HSI sensors and/or the distance of the observed scene, the images invariably contain pixels composed of several materials. It is desirable to resolve the contributions of the constituents from the observed image without relying on high spatial resolution images. Remote sensing cameras have been designed to capture a wide spectral range motivating the use of post-processing techniques to distinguish materials via their spectral signatures. Secondly, available training data for most pattern recognition problems in HSI processing is severely inadequate. Under the framework of statistical classifiers, Hughes [8] was able to demonstrate the impact of this problem on a theoretical K f (x) = ak fk (x), where ak denotes the proportion of the k th class. The proportion ak may be interpreted as the prior probability of observing a sample from class k. Furthermore, the prior probabilities ak for each distribution must be nonnegative and sum-to-one, or ak ≥ 0 f or k = 1, · · ·, K, (2) where K ∗ This ak = 1. work was supported by NASA Earth System Science Fellowship NNX06AF68H. 978-1-4244-5827-1/09/$26.00 ©2009 IEEE (1) k=1 (3) k=1 407 Asilomar 2009
  2. 2. On a similar grounds, a multidimensional data such as the HSI can be modeled by a multidimensional Gaussian mixture (GM)[1]. Normally, a GM in the form of the PDF for z ∈ RP is given by (PCA), this technique is also used to determine the inherent dimensionality of the imagery data. This transformation segregates noise in the data and reduces the computational requirements for subsequent processing [5]. Figure 2 shows the 2D “scatter” plot of the first two MNF components of the original cuprite data. L αi N (z, μi , Σi ) p(z) = i=1 200 where 1 (2π)P/2 |Σi |1/2 1 −1 e{− 2 (z−μi ) Σi (z−μi )} 100 . MNF Band 2 N (z, μi , Σi ) = 150 Here L is the number of mixture components and P the number of spectral channels (bands). The GM parameters are denoted by λ = {αi , μi , Σi }. These parameters are estimated using maximum likelihood (ML) by means of the expectation-maximization (EM) algorithm. 50 0 −50 −100 −150 −200 −1000 −800 −600 −400 −200 0 MNF Band 1 200 400 600 800 Figure 2. 2D scatter plot of the data using the first two MNF bands 3 Dynamic Component Allocation In spite of the good mathematical tractability of GMM, there are challenges trying to train a GM with a local algorithm like EM. First of all, the true number of mixture components is usually unknown. Eventually, not knowing the true number of mixing components is a major learning problem for a mixture classifier using EM [4]. The solution to this problem is a dynamic algorithm for Gaussian mixture density estimation that could effectively add and remove kernel components to adequately characterize the input data. This methodology also increases the chances to escape getting stuck in one of the many local maxima of the likelihood function. The solution to the component initialization is based on a greedy EM approach which begins the GM training with a single component [6]. Components or modes are then added in a sequential manner until the likelihood stops increasing or the incrementally computed mixture is almost as good as any mixture in that form. This incremental mixture density function uses a combination of global and local search each time a new kernel component is added to the mixture. We shall now describe in detail the following three operations-merging, splitting and pruning of the GMM. Figure 1. The scene is a 1995 AVIRIS image of Cuprite field in Nevada with the training regions overlayed. Figure 1 shows data sets used in our experiments that belong to 1995 Cuprite field scene in Nevada. The training regions in the HSI data are identified heuristically from mineral maps provided by Clark et. al. [2]. The remote sensing data sets that we have used in our experiments come from an Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor image. AVIRIS is a unique optical sensor that delivers calibrated images of the upwelling spectral radiance in 224 contiguous spectral channels (bands) with wavelengths from 0.4-2.5 μm. AVIRIS is flown all across the US, Canada and Europe. Since, HSI imagery is highly correlated in the spectral direction using the minimum noise fraction (MNF) transform is natural for dimensionality reduction. This transform is also called the noise adjusted principal component (NAPC) transformation. Like principal component analysis 3.1 Merging of Modes Merging is one of the processes in this proposed training scheme wherein a single mode is created from two identical 408
  3. 3. ones. The closeness between the mixture modes is given by a metric d. For example, consider two PDF’s p1 (x) and p2 (x). Let there be collection of points near the central peak of p1 (x) represented by xi ∈ X1 and another set of points near the central peak of p2 (x) denoted by xi ∈ X2 . In which case the closeness metric d is given by d = log p2 (xi ) xi ∈X1 p1 (xi ) xi ∈X1 p1 (xi ) xi ∈X2 p2 (xi ) xi ∈X2 which directly determines the number of mixture components. This kurtosis measure is given by Ki = wn,i = (4) n wn,i ( Z√−μi )4 Σ N n=1 where i wn,i −3 N (zn , μi , Σi ) N N (z , μ , Σ ) Σn=1 n i i · Therefore, if |Ki | is too high for any component (mode) i, then the mode is split into two. This could be modified to higher dimension by considering skew in addition to the kurtosis, where each data sample Zn is projected on to the j j th principal axis of Σi in turn. Let zn,i (Zn − μi ) Vij where Vij is the j th column of V, obtained from the SVD of Σi . Therefore, for each j Notice that this metric is zero when p1 (x) = p2 (x) and greater than zero for p1 (x) = p2 (x). A pre-determined threshold is set to determine if the modes are too close to each other. Since we assume that p1 (x) and p2 (x) are just two Gaussian modes, it is easy to know where some good points for X1 and X2 are. We choose the means (centers) and then go one standard deviation in each direction along all the principal axes. The principal axes are found by SVD decomposition of R (the Cholesky factor of the covariance matrix). If the two modes are found to be too close, they will be merged forming a weighted sum of two modes(weighted by α1 , α2 ). The mean for this newly merged mode will be • j Ki,j = • α1 μ1 + α2 μ2 (5) α1 + α2 Here μ1 and μ2 are means of the components before merging and μ is the resultant mean after merging of the two components. The proper way to form a weighted combination of the covariances is not simply a weighted sum of the covariances, which does not take in to account the separation of means. Therefore, one needs to implement a more intelligent technique. Consider the Cholesky decomposition of the covariance matrix Σ = R R. It is possible to consider the rows (P )R to be samples of P -dimensional vectors whose covariance is Σ, where P√ the dimension. is 1 The sample covariance is given by P ( P )2 R R = Σ. Now, given the two modes to merge, we regard (P )R1 and (P )R2 as two populations to be joined. The sample covariance of the collection of rows is the desired covariance. But this will assign equal weight to the two populations. To weight them with their respective weights, we multiply them by α1α1 2 and α1α2 2 . Before they can +α +α Zn,i 4 N n=1 wn,i ( si ) N n=1 wn,i −3 j ψi,j = μ= • Zn,i 3 N n=1 wn,i ( si ) N n=1 wn,i mi,j = |Ki,j | + |ψi,j | where s2 = i N j 2 n=1 wn,i (zn,i ) . N n=1 wn,i Now, if mi,j > τ , for any j, split mode i. Further, split the mode by creating the modes at μ = μi + vi,j Si,j and μ = μi − vi,j Si,j , where Si,j is the j th singular value of Σi . The same covariance Σi is used for each new mode. The decision to split or not also depends upon the mixing proportion αi . The splitting does not take place if the value of αi is too small. The optional threshold parameter allows control over splitting. A higher threshold is less likely to split. 3.3 Pruning of Modes When the number of components becomes high they are pruned out as the mixing weight αi falls. Pruning is killing weak modes. This procedure ensures removal of weak modes from the overall mixture. A weak mode is identified by checking αi with respect to certain threshold. Once identified they are obliterated and further re-normalizing αi such that i αi = 1. It is equally important that the algorithm does not annihilate many moderately weak modes all at once. This is achieved by setting up two input threshold values. be joined, however, they must be shifted so they are rereferenced to the new central mean. 3.2 N n=1 Splitting of Components On the other hand if the number of components is too low, then the components are split in order to increase the total number of components. Vlassis et. al. [6] define a method to monitor the weighted kurtosis of each mode 409
  4. 4. 4 Minimum-Message Length Criteria Let us now consider the second mixture learning technique based on the minimum message length (MML) criterion. This method is also known by the name of FigueredoJain algorithm [13]. Using the MML criterion and applying it to mixture models leads to the following objective function V 2 log( {c:αc >0} = 100 50 0 −50 −100 −150 −1000 −800 −600 −400 −200 0 MNF BAND 1 200 400 600 Figure 3. Intermediate learning step of GMM before achieving convergence using the MML criterion. Cnz N N αc )+ log + 12 2 12 Cnz (V + 1) − log L(Z, λ) 2 where N is the number of training points, V is the number of free parameters specifying a component, Cnz is the number of components with non-zero weight in the mixture (αc > 0), λ is the parameter list of the GMM i.e. {α1 , μ1 , Σ1 , · · ·, αC , μC , ΣC }, and the last expression log L(Z, λ) is the log-likelihood of the training data given the distribution parameters λ. The EM algorithm can be used to minimize the above equation with fixed Cnz . This leads to the M-step with component weight updating formula αi+1 c 150 N V n=1 wn,c ) − 2 } C N V j=1 max{0, ( n=1 wn,j ) − 2 200 150 100 MNF Band 2 ∧(λ, Z) = 200 MNF BAND 2 Finally, once the number of modes settles out, then Q stops increasing, and convergence is achieved. Hence, the DCA technique for GMM approximation of HSI observations demonstrates that the combination of covariance constraints, mode pruning, merging and splitting can result in a good PDF approximation of the HSI mixture models. Outliers Class 1 Class 2 Class 3 Miss−classification 50 0 −50 −100 −150 −200 −1000 −800 −600 −400 −200 0 MNF Band 1 200 400 600 800 Figure 4. Classification using MML learning criterion. max{0, ( } This formula contains an explicit rule of annihilating components by setting their weights to zero. The other distribution parameters are updates similar to the previous method. Figure 3 shows an intermediate step of mixture learning using the MML criterion. Having described the two learning criteria, we now evaluate their performance by using a simple Bayesian classifier. Figures 4,5 depict the classification of the three HSI mixture classes based on the two learning criteria. Classification, outliers and miss-classifications are shown in these figures for the two cases. Table 1 shows the classification performance comparison of the DCA learning method vs. the MML. The tabulation shows various classification and miss-classification rates of the two learning methodologies based on the amount of training data utilized. 410 200 150 MNF Band 2 5 Classification Results 250 100 Outliers Class 1 Class 2 Class 3 Miss−Classification 50 0 −50 −100 −150 −200 −1000 −800 −600 −400 −200 0 200 MNF Band 1 400 600 800 1000 Figure 5. Classification using the DCA learning criterion.
  5. 5. plications for noice removal,” in IEEE Transactions on Geoscience and Remote Sensing, pp. 6574, Vol.26, No.1, 1988. Table 1. Classification performance comparisons between the DCA learning criterion and MML Learning criterion. Training Data (in percentage) Utilized 55 % 60 % 65 % 70 % 75 % 80 % Classification (in percentage) MML DCA 68.7691 69.8148 73.5235 74.6569 72.6667 73.5882 74.54 75.1457 75.1541 74.3254 75.6765 75.5392 [6] N. Vlassis and A. Likas, “A kurtosis-based dynamic approach to Gaussian mixture modeling,” IEEE Transactions on Systems, Man and Cybernetics, vol. 29, pp. 393399, 1999. Miss-Class (# of pixels) MML DCA 1087 1054 957 970 647 606 1230 866 857 752 404 468 [7] N. Vlassis and A. Likas, “A greedy EM for Gaussian mixture learning,” Neural Processing Letters, vol. 15, pp. 7787, 2002. [8] G. F. Hughes, “On mean accuracy of statistitical pattern recognizers”, IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55-63, 1968. [9] D. A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing. ISBN 0-471-42028-X, Wiley Inter-Science Publishers, 2003. 6 Conclusions In this paper, we evaluated the performance of two PDF mixture learning criterion for the reduced dimensionality material classification of Hyperspectral remote sensing data. The results show that the two methods have nearly identical classification performance. The outcome of this paper presents a possible integration of advanced data analysis and modeling tools to scientists, advancing the stateof-the-practice in the utilization of satellite image data to various types of Earth System Science studies. [10] B. S. Everitt and D. J. Hand, Finite Mixture Distributions. London: Chapman and Hall, 1981. [11] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, 2001. [12] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39B, No. 1, pp. 138, 1977. References [13] M. A. T. Figueiredo and A. K. Jain, “Unsupervised Learning of Finite mixture models,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 3, pp. 381396, March 2002. [1] D. G. Manolakis, and G. Shaw, “Detection Algorithms for Hyperspectral Imaging Applications,” IEEE Signal Processing Magazine, Vol. 19, Issue 1, January 2002, ISSN 1053-5888. [14] A. K. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall, 1988. [2] R. N. Clark, A. J. Gallagher, and G. A. Swayze, “Material absorption band depth mapping of imaging spectrometer data using a complete band shape least-squares fit with library reference spectra”, Proceedings of the Second Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Workshop, JPL Publication 90-54, pp. 176-186, 1990. [15] B. H. Huang, “Maximum Likelihood estimation for mixture multivariate stochastic observations of Markov chains”, AT&T Technical Journal,vol. 64, No. 6, pp. 1235-1249, 1985. [16] L. Liporace, “Maximum likelihood estimation for multivariate observations of Markov sources,” IEEE Transactions on Information Theory, vol. 28, Issue 5, pp. 729-734, 1982. [3] G. J. McLachlan and K. E. Basford, Mixture Models: Inference and Applications to Clustering. New York: M. Dekker, 1988. [4] G. J. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, 2000. [17] A.K. Jain, R. Duin and J. Mao, Statistical pattern recognition: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-38, 2000. [5] A.A. Green, M. Berman, P. Switzer and M.D. Craig, “A transformation for ordering multispectral data in terms of image quality with im- 411