Your SlideShare is downloading. ×
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Classification of brazilian soils by using libs and variable selection in the wavelet domain
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Classification of brazilian soils by using libs and variable selection in the wavelet domain

191

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
191
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Analytica Chimica Acta 642 (2009) 12–18Contents lists available at ScienceDirectAnalytica Chimica Actajournal homepage: www.elsevier.com/locate/acaClassification of Brazilian soils by using LIBS and variable selectionin the wavelet domainMárcio José Coelho Pontesa, Juliana Cortezb, Roberto Kawakami Harrop Galvãoc, Celio Pasquinib,Mário César Ugulino Araújoa,∗, Ricardo Marques Coelhod, Márcio Koiti Chibad,Mônica Ferreira de Abreud, Beáta Emöke MadarieaUniversidade Federal da Paraíba, Departamento de Química, João Pessoa, PB, BrazilbUniversidade Estadual de Campinas, Instituto de Química, Campinas, SP, BrazilcInstituto Tecnológico de Aeronáutica, Divisão de Engenharia Eletrônica, São José dos Campos, SP, BrazildInstituto Agronômico de Campinas, Centro de Pesquisa e Desenvolvimento de Solos e Recursos Ambientais, Campinas, SP, BrazileEmbrapa Solos, Rio de Janeiro, RJ, Brazila r t i c l e i n f oArticle history:Received 28 September 2008Accepted 4 March 2009Available online 13 March 2009Keywords:Brazilian soilsLaser-induced breakdown spectroscopyClassificationWavelet compressionSuccessive projections algorithmLinear discriminant analysisa b s t r a c tThis paper proposes a novel analytical methodology for soil classification based on the use of laser-inducedbreakdown spectroscopy (LIBS) and chemometric techniques. In the proposed methodology, linear dis-criminant analysis (LDA) is employed to build a classification model on the basis of a reduced subsetof spectral variables. For the purpose of variable selection, three techniques are considered, namely thesuccessive projection algorithm (SPA), the genetic algorithm (GA), and a stepwise formulation (SW). Theuse of a data compression procedure in the wavelet domain is also proposed to reduce the computa-tional workload involved in the variable selection process. The methodology is validated in a case studyinvolving the classification of 149 Brazilian soil samples into three different orders (Argissolo, Latossoloand Nitossolo). For means of comparison, soft independent modelling of class analogy (SIMCA) modelsare also employed. The best discrimination of soil types was attained by SPA–LDA, which achieved anaverage classification rate of 90% in the validation set and 72% in cross-validation. Moreover, the pro-posed wavelet compression procedure was found to be of value by providing a 100-fold reduction incomputational workload without significantly compromising the classification accuracy of the resultingmodels.© 2009 Elsevier B.V. All rights reserved.1. IntroductionSoil classification is an important subject in several areas, such asagriculture and civil engineering. In fact, proper handling and useof the soil, including cultivation planning and design of drainagesystems, depend on the soil class. The Brazilian System of SoilClassification [1] employs chemical, physical and morphologicalparameters. However, the reference methods for determination ofthese parameters are laborious and time-consuming, mainly dueto the required sample treatment procedures. In addition, someclassification criteria are subjective and difficult to quantify. TheAmerican [2] and French [3] systems of soil classification also sufferfrom the same problems.∗ Corresponding author at: Universidade Federal da Paraíba, Departamentode Química – Laboratório de Automac¸ ão e Instrumentac¸ ão em QuímicaAnalítica/Quimiometria (LAQA), Caixa Postal 5093, CEP 58051-970 – João Pessoa,PB, Brazil. Tel.: +55 83 3216 7438; fax: +55 83 3216 7437.E-mail address: laqa@quimica.ufpb.br (M.C.U. Araújo).Some papers have been published on the use of parameterssuch as fertility [4] and morphological characteristics [5] for soilclassification. However, few works have been concerned with thedevelopment of analytical techniques and/or data treatment pro-cedures to simplify the use of existing soil classification systems[6–8].Zagatto [6] classified some types of Brazilian soils on the solebasis of chemical composition. For this purpose, the total contentof 20 elements in soil samples and their extracts were quantifiedby Inductively Couple Plasma Optical Emission (ICP-OES) or AtomicAbsorption Spectroscopy (AAS) and used as parameters for classifi-cation. K-nearest neighbors (KNN) and Soft Independent Modellingof Class Analogy (SIMCA) were employed and a correct classificationrate of 80% was obtained.Demattê et al. [7] evaluated soil types and soil tillage systemsby using visible (VIS)–near infrared (NIR) reflectance spectroscopyin the 450–2500 nm range. Different depths were utilized to deter-mine soil classes. Soil survey maps were developed by descriptiveinterpretation of the spectral curves and statistical analysis. Theresults were favourably compared to those of a conventional0003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.aca.2009.03.001
  • 2. M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18 13method in terms of soil line demarcation and number of detectedsoil classes.Mouazen et al. [8] employed VIS–NIR reflectance spectroscopy(306.5–1710.9 nm) to discriminate soil texture classes. Factorial dis-criminant analysis (FDA) was applied to the first five principalcomponents (PCs) resulting from the principal component analysis(PCA) of the VIS–NIR spectra. Four different classes of soil sampleswere classified with 85.7% and 81.8% of correct classification forthe calibration and validation sets, respectively. After two similarclasses (coarse and fine sand) were merged, the correct classifica-tion rate increased to 89.9% (calibration) and 85.1% (validation).The present paper proposes a novel analytical methodologyfor soil classification based on the use of laser-induced break-down spectroscopy (LIBS). In LIBS, a pulsed laser of high power isfocused on the sample surface. The high power per area (irradiance)causes the vaporization of sample constituents and the formation ofplasma. The spectrum of emission from the plasma is then acquiredand used as analytical response. This technique can be applied tosolid, liquid or gaseous materials with little or no sample treatment[9].LIBS has been successfully applied to classification of differentsamples, including chemical and biological warfare agent simu-lants [10], alloys [11], archaeological objects [12], polymers [13],explosives [14], among others. However, only one paper has beenpublished on the use of LIBS in the context of soil classification[15]. In that work, LIBS spectra initially containing more than 50,000points were reduced to 68 points corresponding to the spectral linesof eight major soil elements (aluminium, silicon, iron, calcium, mag-nesium, potassium, titanium and manganese) and PCA was appliedto the reduced data. As a result, only two soil classes could be dis-criminated. The substantial dispersion of the remaining samplesprevented an adequate classification.The present paper investigates the use of LIBS and chemometrictechniques for classification of Brazilian soil samples into three dif-ferent orders, namely Argissolo, Latossolo and Nitossolo. These soilorders were defined in the Brazilian System of Soil Classification[1], created in 1999. According to the international classification ofFAO (Food and Agriculture Organization of the United Nations) [16],the Argissolo, Latossolo and Nitossolo orders are equivalent to theAcrisol, Ferralsol and Nitisol soil groups, respectively. The Argissoloorder consists of exchangeable basic-cation poor, morphologicallyand physically heterogeneous soils. Latossolo soils are exchange-able basic-cation poor and morphologically homogeneous. TheNitossolo order comprises soils with variable content of exchange-able cations, carrying a unique set of physical and morphologicalproperties that reflects on a typical hydrological and mechanicalbehaviour. These three orders are representative of humid tropi-cal regions with soils typically developed from highly weatheredparent material. These soils are constituted mostly by iron andaluminium oxides (e.g. goethite and gibbsite) and 1:1 (Si:Al) layersilicate (basically kaolinite). According to IBGE [17], Argissolo andLatossolo are predominant in Brazil, as well as in other countries ofSouth America. Nitossolo corresponds to approximately 1% of theBrazilian territory.Owing to the very large number of variables in a LIBS spectrum,the use of appropriate feature extraction procedures is required.In this context, a possible approach consists of selecting spectrallines corresponding to specific elements [15]. However, in order toreduce the possibility of losing relevant information for the classifi-cation task, the present work employs statistical variable selectionalgorithms instead of a priori considerations. More specifically, thesuccessive projection algorithm (SPA) [18], the genetic algorithm(GA) [18], and a stepwise formulation (SW) [19] are adopted for thispurpose. Linear discriminant analysis is then employed to obtaina classification model based on the selected spectral variables. Inaddition, the use of a data compression procedure in the waveletdomain is proposed to reduce the computational workload involvedin the variable selection process. For means of comparison, theresults obtained by using SIMCA models are also presented.2. TheoryThe linear discriminant analysis (LDA) classification methodemploys linear decision boundaries (hyperplanes), which aredefined in order to maximize the ratio of between-class to within-class dispersion [20]. In order to have a well-posed problem, thenumber of calibration (training) objects must be larger than thenumber of variables to be included in the LDA model. Therefore, theuse of LDA for classification of spectral data usually requires appro-priate variable selection procedures [18,19,21]. In this section, thethree algorithms adopted for this purpose in the present work (SPA,SW, and GA) will be described. Moreover, a wavelet compression(WC) method, which can be employed prior to variable selection,will also be presented.2.1. Successive projections algorithmThe successive projections algorithm (SPA) was originally pro-posed by Araújo et al. [22] to minimize multi-collinearity effectsand thus improve the conditioning of multiple linear regression(MLR) modelling for spectral data. In the original formulation, can-didate subsets of variables were defined as the result of projectionoperations carried out on the matrix of instrumental response data.These subsets were then used to build MLR models, which werecompared in terms of the prediction error in a set of validationsamples. This validation set was not employed in either the projec-tion operations or the calibration of the MLR models. At the end, thesubset of variables leading to the smallest root-mean-square errorof validation (RMSEV) was adopted.In a subsequent paper [18], SPA was adapted for use in clas-sification problems. As in the original formulation, the candidatesubsets of variables were formed as the result of projection opera-tions intended to minimize multi-collinearity effects, which are aknown cause of poor generalization performance in LDA [23]. How-ever, the RMSEV metric was replaced with an average risk G of LDAmisclassification. Such a cost function is calculated in the validationset asG =1KvKvk=1gk, (1)where gk (risk of misclassification of the kth validation object xk,k = 1, . . ., Kv) is defined asgk =r2(xk, Ik)minIj /= Ikr2(xk, Ij). (2)In this definition, the numerator r2(xk, ␮Ik) is the squared Maha-lanobis distance [24] between object xk (of class index Ik) and thesample mean ␮Ik of its true class. The denominator in Eq. (2) cor-responds to the squared Mahalanobis distance between object xkand the center of the closest wrong class. In the Mahalanobis dis-tance calculations, the sample mean for each class and the pooledcovariance matrix for each variable subset under consideration arecomputed by using the training data.2.2. Stepwise algorithmThe stepwise (SW) selection algorithm adopted in the presentwork was proposed by Caneca et al. [19] for classification of diesel-engine lubricating oils on the basis of near and mid-infrared spectra.Initially, the algorithm calculates the discriminability of each spec-tral variable with respect to the classes under consideration [20].
  • 3. 14 M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18Fig. 1. Filter bank implementation of the wavelet transform. In this diagram, H, G represent a low-pass and a high-pass digital filter, respectively, and ↓2 denotes the dyadicdownsampling operation.The variable with the largest discriminability value is selected anda leave-one-out cross-validation procedure is carried out by usingLDA. Among the remaining variables, those having a large correla-tion with the selected one are then discarded to avoid collinearityproblems. This process is repeated at each subsequent iterationby successively adding variables to the LDA model until no morevariables are available for selection. The subset of variables leadingto the smallest number of cross-validation errors is then adopted.If different subsets lead to the same number of cross-validationerrors, the subset with the smallest number of variables is chosen.It is worth noting that, after the second iteration, the discardingof variables is based on the coefficient of multiple correlation, whichis defined, for each variable xi still available for selection, asri =(ˆxi)(xi), (3)where (·) denotes the standard deviation calculated in the trainingset and ˆxi is an estimate of xi obtained by multiple linear regressionfrom the variables already selected. If ri is close to one, variable xi isredundant because its values can be predicted, with good accuracy,from the variables already included in the LDA model. An inconve-nience of this algorithm is the need to set a threshold for ri in orderto decide which variables are to be discarded. However, it is possi-ble to test different threshold values and then compare the resultingLDA models on the basis of the classification errors obtained in aseparate validation set.2.3. Genetic algorithmThe GA is a versatile search technique inspired in the biologicalmechanisms of evolution by natural selection [25–27]. In vari-able selection problems, the algorithm typically encodes subsetsof variables in the form of strings of binary (0/1) values termed“chromosomes”. Each position (or “gene”) in the chromosome isassociated to one of the variables available for selection. GenesFig. 2. Diagram of LIBS instrument. (a) Laser source and cooler, (b) Nd:YAG laserhead, (c) dicroic mirror, (d) focusing lens, (e) soil sample, (f) sample cell, (g) collectinglens, (h) fiber optic, (i) detector trigger signal, (j) echelle polychromator, (k) ICCDdetector and (l) computer.with a “1” value indicate that the corresponding variables are tobe included in the model. The algorithm starts with a populationof randomly generated chromosomes, which are then combinedaccording to certain rules in order to generate a new generationof chromosomes (offspring). This process is repeated until a givenstopping criterion is satisfied.The present work adopts the GA formulation presented in Ref.[18], which has the following features. A fitness value is definedfor each chromosome as the inverse of the validation cost definedin Eq. (1) calculated for the subset of variables encoded in thechromosome (“1” genes). The probability of a given chromosomebeing selected for offspring generation is proportional to its fitness(“roulette” method) [25]. By using this probabilistic method, pairsof chromosomes are formed and then combined to generate pairsof descendants by one-point crossover and mutation operators. Thepopulation size is kept constant, each generation being completelyreplaced by its descendants. However, the best individual is auto-matically transferred to the next generation (elitism) to avoid theloss of good solutions. This evolutionary process is repeated until apre-specified number of cycles is completed.2.4. Wavelet compressionThe SPA, SW and GA algorithms described above may involveconsiderable computational workload if the number of variables islarge, as in the case of LIBS spectra. This problem can be alleviatedby using a compression technique to reduce the dimensionality ofthe data prior to the variable selection procedures. In the presentwork, a wavelet compression method is adopted for this purpose.The wavelet transform (WT) is a multi-resolutional signalprocessing tool [28] that has found several applications in denois-ing, feature extraction and compression of instrumental signals[29–34]. The WT of a spectrum x = [x( 1) x( 2) · · · x(␭J)], where jis the jth wavelength, can be obtained by using a digital filter bankstructure [28,31,35] of the form depicted in Fig. 1.The basic structure of the filter bank consists of a pair of low-pass (H) and high-pass (G) filters, followed by a downsamplingoperation, which discards one in every two points of the filteringoutcome. The downsampled output of the low-pass filter, termed“approximation coefficients”, is a smoothed version of the spec-Table 1Number of training and validation samples in each class.Class SetTraining ValidationArgissolo 31 15Latossolo 56 28Nitossolo 12 7Total 99 50
  • 4. M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18 15Fig. 3. Mean LIBS spectrum of each soil order.trum at a coarser resolution. The downsampled output of thehigh-pass filter, termed “detail coefficients”, correspond to high-frequency noise, as well as sharp features of the spectrum, suchas narrow peaks. This operation can be reapplied to the approx-imation coefficients up to the number of decomposition levelsspecified by the analyst. The result of the transform comprisesthe final approximation coefficients, as well as the detail coeffi-cients obtained along the entire filter bank. With a slight abuseof language, this result will be henceforth termed “wavelet coeffi-cients”.The H and G filters employed in the filter bank are typically offinite length, which implies that each approximation or detail coef-ficient corresponds to a reduced range of wavelengths within thespectrum. This spatial localization feature is often invoked as oneof the main advantages of WT over the Fourier transform [28,35].However, the choice of appropriate H and G filters for a specificapplication may not be straightforward [29,31]. In the present work,different wavelet filters were tested and compared in terms ofcompression ability for the LIBS data set under consideration. Thedecomposition levels were set to the maximum number for whichthe spatial localization features of the WT are not lost [32]. Thislimit situation occurs when the H, G filters span the entire lengthof the downsampled approximation coefficients [36].3. Experimental3.1. Brazilian soil data setA total of 149 Brazilian soil samples of three different orders(Argissolo: 46, Latossolo: 84 and Nitossolo: 19) collected at the Bhorizon (subsurface layer) were employed in the study. Before LIBSspectral recording, these samples were dried in an oven at 105 ◦Cfor 2.5 h, ground and sieved to a particle size smaller than 350 ␮m.Table 2Classification rates obtained with GA–LDA, SW–LDA, SPA–LDA and SIMCA for (1) Argissolo, (2) Latossolo and (3) Nitossolo. The number of spectral variables employed in eachmodel is indicated in parenthesis. N indicates the number of samples employed in the calculation of the classification rates.True class index N GA–LDA (17) SW–LDA (7) SPA–LDA (5) SIMCAPredicted class index (%) Predicted class index (%) Predicted class index (%) Predicted class index (%)Validation set 1 2 3 1 2 3 1 2 3 1 2 31 15 73 13 13 73 27 0 80 20 0 100 80 802 28 0 89 11 0 79 21 4 89 7 93 100 793 7 0 29 71 0 29 71 0 0 100 100 100 100Cross-validation1 46 72 15 13 74 20 7 70 20 11 98 72 672 84 11 69 20 10 75 16 11 73 17 79 98 603 19 16 32 53 16 16 68 11 16 74 90 95 100
  • 5. 16 M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18Fig. 4. (a) PC2 × PC1 and (b) PC3 × PC1 score plots for the overall set of 149 soilsamples (O: Argissolo, : Latossolo, : Nitossolo).3.2. LIBS instrumentThe measurements were carried out with a lab-made LIBSinstrument consisting of a Nd:YAG laser (Quantel, 1064 nm,360 mJ/pulse and pulse duration of 5 ns), an echelle polychro-mator (52.13 lines/mm, Mechelle 5000, Andor Technology), anIntensified Charge Couple Device (ICCD) detector with an arrayof 1024 × 1024 pixels (Model DH734, Andor Technology) and anadjustable position plate for the sample. Fig. 2 presents a diagramof the LIBS instrument.3.3. Spectra acquisitionThirty spectra were acquired for each sample by applying thelaser pulse to different points of the sample surface. Prior to themeasurement process, the sample cell was filled and the soil surfacewas levelled. After every five measurements, on different points,the sample surface was re-levelled to eliminate the small cratersproduced by the laser beam.The laser energy, delay time and integration time gate were110 mJ/pulse, 500 ns and 10 ␮s, respectively. The focal point was sit-uated 0.5 cm below the sample surface. The spectra were acquiredin the range 203.13–987.64 nm. Each resulting spectrum had 26,624points.3.4. SoftwareEach individual spectrum was pre-treated by Standard NormalVariate (SNV) [37]. Afterwards, the average spectrum for each sam-ple was calculated. The average spectra were then divided intotraining and validation sets by using the classic Kennard-Stone (KS)algorithm [38]. The KS algorithm was applied to each class sepa-rately, as described in Ref. [18]. The number of samples in each setis presented in Table 1.For the purpose of WC, 22 different wavelets were tested (Symlet4-10, Daubechies 1-10 and Coiflet 1-5). The low-pass and high-pass filters for dbN, symN and coifN have length 2N, 2N, and 6N,respectively (i.e., small values of N are associated to wavelets ofsmall width). These wavelets were selected in view of previousworks concerning FT-IR [36] and UV–VIS [39] spectrometry. Themaximum number of decomposition levels for each wavelet wasemployed, as discussed in Section 2.4. The percentage of data vari-ance retained in the compression process was set to 95%.SNV, PCA and SIMCA were performed with the default settingsof the Unscrambler® 9.6 software (CAMO A/S). The optimal numberof PCs was determined from the residual variance curve. The firstlocal minimum is adopted unless later PCs give significantly lowerresidual variance. The significance level of the F-test for SIMCA clas-sification was set to the default value (5%). The WC, KS, GA–LDA,SW–LDA and SPA–LDA classification routines were implemented inMatlab® 6.5. The GA routine was carried out during 200 generationswith 400 chromosomes each. Crossover and mutation probabili-ties were set to 60% and 10%, respectively, as in [18]. Moreover, thealgorithm was repeated three times, starting from different ran-dom initial populations. The best solution (in terms of the fitnessvalue) resulting from the three realizations of the GA was employed.Seven threshold values (0.1, 0.2, 0.5, 0.7, 0.8, 0.9, and 0.95) forthe coefficient of multiple correlation were tested in the SW–LDAalgorithm. The best threshold was selected on the basis of the classi-fication errors in the validation set. If two threshold values providedthe same number of classification errors, the threshold providingthe simplest model (smallest number of selected variables) wasfavoured.The results were expressed in terms of classification rates for thevalidation set. In addition, cross-validation results were obtained byapplying the leave-one-out approach to the entire data set of 149samples.4. Results and discussionFig. 3 presents the mean LIBS spectrum of each soil order inthe range of approximately 203–1000 nm. As can be seen, discrim-inating the three soil orders on the basis of LIBS measurements isnot straightforward, owing to the complexity of the spectra. Thedifficulty involved in the classification task is also apparent in thePC score plots presented in Fig. 4. As can be seen, the dispersionwithin each class is considerable. Such a dispersion can be ascribedto the poor repeatability of the LIBS measurements, as well as thelarge chemical and mineralogical variability within each soil type.In Fig. 4, the best discrimination is found between Latossolo andArgissolo samples, which are reasonably well separated along PC1.In fact, these two orders are the most distinct in terms of miner-alogical constitution. However, they are considerably overlappedby Nitossolo. It may be argued that distinctive features of Nitossoloare not adequately captured by the LIBS spectra.4.1. Classification in the original spectral domainTable 2 presents the classification results (validation set andcross-validation) obtained in the original spectral domain. This
  • 6. M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18 17Fig. 5. Determination of the optimum number of variables in SPA–LDA.table also indicates the number of spectral variables (wavelengths)employed in each model. In the case of SW-LDA, the threshold valueselected according to the criteria described in Section 3.4 was 0.2.The number of variables for SPA–LDA was determined from theminimum of the cost function displayed in Fig. 5.The rates in Table 2 express both correct classifications (pre-dicted class index equal to correct class index) and incorrectclassifications (predicted class index different from correct classindex). In each LDA model, the three rates in a row add up to 100%,because every sample is included in one and only one class. Forexample, the 15 validation samples of class 1 (Argissolo) were clas-sified by GA–LDA in the following manner: 11 samples (73%) werecorrectly included in class 1, two samples (13%) were incorrectlyincluded in class 2 (Latossolo), and two samples (13%) were incor-rectly included in class 3 (Nitossolo). In contrast, SIMCA may includea given sample in more than one class. Therefore, the sum of thethree rates in a row may be larger than 100% for SIMCA.Among the LDA models, the worst overall results in terms ofvalidation and cross-validation were obtained with GA–LDA. Thisfinding may be ascribed to the fact that GA–LDA does not take intoaccount multicollinearity effects in the variable selection process,whereas SPA–LDA and SW–LDA were designed to minimize sucheffects. In fact, it is worth noting that GA–LDA selected a largernumber of spectral variables (17), as compared to SW–LDA (7) andSPA–LDA (5). As regards the comparison between SW–LDA andSPA–LDA, it can be seen that SPA–LDA provides better results inthe validation set for all three soil types (average correct classifica-tion rate of 90%). In terms of overall cross-validation performance,SW–LDA and SPA–LDA are similar, as the average correct classifica-tion rate was 72% for both models.SIMCA provided good validation and cross-validation results interms of correctly including the samples in their true class. How-ever, almost all samples were also included in an incorrect class.This problem may be ascribed to the dispersion and overlapping ofthe soil classes, as seen in the score plots presented in Fig. 4.4.2. Use of wavelet compressionAs discussed in Section 3.4, 22 wavelets were tested for com-pression of the LIBS spectra. Table 3 presents the results, which areexpressed in terms of the number of coefficients required to explain95% of the data variance. On the overall, the best performances (i.e.,the smallest number of required coefficients) were obtained withthe smallest wavelets within each family. In fact, small waveletsTable 3Number of wavelet coefficients required to explain 95% of the data variance.Wavelet Number of retained coefficientsSym4 663Sym5 684Sym6 696Sym7 701Sym8 723Sym9 729Sym10 751Db1 785Db2 692Db3 690Db4 738Db5 753Db6 781Db7 818Db8 858Db9 865Db10 896Coif1 678Coif2 677Coif3 700Coif4 719Coif5 751may be a better match to the narrow emission peaks found in LIBSspectra.Classification tests were carried out by using the five bestwavelets in terms of compression performance (sym4, db2, db3,coif1 and coif2). Table 4 presents the validation results obtainedby applying GA–LDA, SW–LDA and SPA–LDA to the compresseddata set. The best wavelets for GA–LDA, SW–LDA and SPA–LDAwere sym4 (considering compression performance in addition tothe classification rate), coif1 and coif2, respectively. By using thesewavelets, a classification rate of 84% was obtained with the threeLDA models. For GA–LDA and SW–LDA, this rate is an improvementin comparison with the results obtained in the original spectraldomain. In the case of SPA–LDA, the result became slightly worse,as the classification rate in the original domain was 90%. However,the computation workload involved in the modelling process wassubstantially reduced by the use of WC, as the number of variableswas reduced by a factor of 40 (from 26,624 to 677 with coif2, forexample). By using a computer with a Celeron 2.66 GHz processorand 2 GB RAM, the time required for variable selection by SPA wasreduced from approximately 1000 min to 8 min. It is worth not-ing that the time spent in the WC process itself is relatively small(approximately 28 s for the coif2 wavelet).By using the best wavelet for each model, the cross-validationrates for GA–LDA, SW–LDA and SPA–LDA were 69%, 70% and 71%,respectively. For SW–LDA and SPA–LDA, these results are slightlyworse than the rate obtained in the original domain (72%). ForGA–LDA, the result is actually better, as the rate obtained in theoriginal domain was 65%. In view of the overall validation and cross-validation results, it can be concluded that the WC process doesnot significantly compromise the classification performance of theresulting models.Table 4Average classification rate (%) in the validation set (original spectral domain andwavelet-compressed data).GA–LDA SW–LDA SPA–LDAOriginal domain 78 74 90Sym4 84 81 79Db2 84 77 68Db3 84 83 80Coif1 77 84 79Coif2 83 75 84
  • 7. 18 M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18Table 5Classification rates obtained with SPA–LDA and SIMCA for (1) Argissolo, (2) Latossolo and (3) Nitossolo. The number of wavelet coefficients employed in each model isindicated in parenthesis. N indicates the number of samples employed in the calculation of the classification rates.True class index N SPA–LDA (6) SIMCA (677) SIMCA (6)Predicted class index (%) Predicted class index (%) Predicted class index (%)Validation set 1 2 3 1 2 3 1 2 31 15 67 20 13 100 80 80 100 80 672 28 4 86 11 93 100 68 86 96 543 7 0 0 100 100 100 100 71 43 100Cross-validation1 46 67 17 15 96 70 65 93 65 632 84 10 71 19 79 98 52 80 94 603 19 16 11 74 95 95 100 74 90 100For comparison purposes, Table 5 presents the classificationresults of SPA–LDA and SIMCA for the coif2-compressed data set.SIMCA models were constructed with the 677 coefficients resultingfrom the WC compression process and also with the six coefficientsselected by SPA–LDA. On the overall, the SIMCA classification rateswere similar to those obtained in the original domain with the fullspectrum (Table 2). This result corroborates the conclusion that thewavelet compression retains discriminatory information concern-ing the soil classes under study.5. ConclusionsThis paper presented a novel methodology for soil classifica-tion based on the use of LIBS data and chemometrics methods.The methodology was validated in a case study involving threeBrazilian soil types (Argissolo, Latossolo and Nitossolo). Better dis-crimination of the soil types was attained by employing a subsetof selected spectral variables for LDA, as compared to the use offull-spectrum SIMCA modelling. More specifically, the best resultswere obtained with SPA–LDA, which achieved an average classifi-cation rate of 90% in the validation set and 72% in cross-validation.The proposed wavelet compression procedure was useful to reducethe computational workload (by a factor of 100) without signifi-cantly compromising the classification accuracy. It is worth notingthat, after the classification models have been obtained, the pro-posed methodology can be applied to new samples in a fast andstraightforward manner.Future works could investigate the combination of LIBS withother techniques, such as VIS–NIR spectroscopy, for the purposeof improving the classification outcome.AcknowledgmentsThe authors thank PROCAD/CAPES (Grant 0081/05-1) andFAPESP (Grant 03/07419-5) for partial financial support. Theresearch fellowships and scholarships granted by CNPq and CAPESare also gratefully acknowledged.References[1] H.G. Santos; P.K.T. Jacomine, L.H.C. Anjos, V.A. Oliveira, J.B. Oliveira, R.M. Coelho,J.F. Lumbreras, T.J.F. Cunha, Sistema Brasileiro de Classificac¸ ão de Solos, 2ndedition, Embrapa Solos, Rio de Janeiro, 2006.[2] Soil Survey Staff, Keys to Soil Taxonomy, 9th ed., United States Department ofAgriculture, Washington, 2003.[3] D. Baize, M.C. Girard, Référentiel pédologique, Paris, 1995.[4] P. Tittonell, K.D. Shepherd, B. Vanlauwe, K.E. Giller, Agr. Ecosyst. Environ. 123(2008) 137.[5] J.D. Phillips, D.A. Marion, Geoderma 141 (2007) 89.[6] E.A.G. Zagatto, Análises Químicas Multielementares em Sistemas FIA-ICP-GSAMe Classificac¸ ões dos Solos do Estado de São Paulo, Doctoral thesis, UniversidadeEstadual de Campinas, Campinas, 1981.[7] J.A.M. Demattê, R.C. Campos, M.C. Alves, P.R. Fiorio, M.R. Nanni, Geoderma 121(2004) 95.[8] A.M. Mouazen, R. Karoui, J. Baerdemaeker, H. Ramon, J. Near Infrared Spectrosc.13 (2005) 231.[9] C. Pasquini, J. Cortez, L.M.C. Silva, F.B. Gonzaga, J. Braz. Chem. Soc. 18 (2007)463.[10] C.A. Munson, F.C. Lucia Jr., T. Piehler, K.L. McNesby, A.W. Miziolek, Spectrochim.Acta Part B 60 (2005) 1217.[11] S.R. Goode, S.L. Morgan, R. Hoskins, A. Oxsher, J. Anal. At. Spectrom. 15 (2000)1133.[12] M. Corsi, G. Cristoforetti, M. Giuffrida, M. Hidalgo, S. Legnaioli, L. Masotti, V.Palleschi, A. Salvetti, E. Tognoni, C. Vallebona, A. Zanini, Microchim. Acta 152(2005) 105.[13] R. Sattmann, I. Mönch, H. Krause, R. Noll, S. Couris, A. Hatziapostolou, A.Mavromanolakis, C. Fotakis, E. Larrauri, R. Miguel, Appl. Spectrosc. 52 (1998)456.[14] W. Schade, C. Bohling, K. Hohmann, D. Scheel, Laser Part. Beams 24 (2006) 241.[15] B. Bousquet, J.-B. Sirven, L. Canioni, Spectrochim, Acta Part B 62 (2007) 1582.[16] IBGE (Brazilian Institute of Geography and Statistics), EMBRAPA (Brazilian Agri-culture Research Institute), Soil Map of Brazil (1:5,000,000), 2001. Available at:http://mapas.ibge.gov.br/solos/viewer.htm (accessed in March 2008).[17] IUSS Working Group WRB, World Reference Base for Soil Resources, World SoilResources Reports, 103, 128, 2006.[18] M.J.C. Pontes, R.K.H. Galvão, M.C.U. Araújo, P.N.T. Moreira, O.D.P. Neto, G.E. José,T.C.B. Saldanha, Chemom. Intell. Lab. Syst. 78 (2005) 11.[19] A.R. Caneca, M.F. Pimentel, R.K.H. Galvão, C.E. Matta, F.R. Carvalho, I.M.Raimundo Jr., C. Pasquini, J.J.R. Rohwedder, Talanta 70 (2006) 344.[20] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., John Wiley, NewYork, 2001.[21] Y. Mallet, D. Coomans, O. de Vel, Chemom. Intell. Lab. Syst. 35 (1996) 157.[22] M.C.U. Araújo, T.C.B. Saldanha, R.K.H. Galvão, T. Yoneyama, H.C. Chame, V. Visani,Chemom. Intell. Lab. Syst. 57 (2001) 65.[23] T. Naes, B.H. Mevik, J. Chem. 15 (2001) 413.[24] R. de Maesschalck, D. Jouan-Rimbaud, D.L. Massart, Chemom. Intell. Lab. Syst.50 (2000) 1.[25] D.E. Goldberg, Genetic Algorithms in Search, Optimization,and Machine Learn-ing, Addison-Wesley Longman Publishing Co., Inc., Boston, 1989.[26] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. Noord, Anal. Chem. 67 (1995)4295.[27] R. Leardi, J. Chem. 15 (2001) 559.[28] B. Walczak, Wavelets in Chemistry, Elsevier Science, New York, 2000.[29] C. Cai, P.B. Harrington, J. Chem. Inf. Comput. Sci. 38 (1998) 1161.[30] U. Depczynski, K. Jetter, K. Molt, A. Niemoller, Chemom. Intell. Lab. Syst. 49(1999) 151.[31] C.J. Coelho, R.K.H. Galvão, M.C.U. Araújo, M.F. Pimentel, E.C. Silva, J. Chem. Inf.Comput. Sci. 43 (2003) 928.[32] R.K.H. Galvão, H.A.D. Filho, M.N. Martins, M.C.U. Araújo, C. Pasquini, Anal. Chim.Acta 581 (2007) 159.[33] A.C. Sousa, M.M.L.M. Lucio, O.F. Bezerra Neto, G.P.S. Marcone, A.F.C. Pereira, E.O.Dantas, W.D. Fragoso, M.C.U. Araújo, R.K.H. Galvão, Anal. Chim. Acta 588 (2007)231.[34] S. Ren, L. Gao, Talanta 50 (2000) 1163.[35] M. Vetterli, J. Kovacevic, Wavelets and Subband Coding, Prentice-Hall, NewJersey, 1995.[36] R.N.F. Santos, R.K.H. Galvão, M.C.U. Araújo, E.C. Silva, Talanta 71 (2007) 1136.[37] R.J. Barnes, M.S. Dhanoa, S.J. Lister, Appl. Spectrosc. 43 (1989) 772.[38] R.W. Kennard, L.A. Stone, Technometrics 11 (1969) 137.[39] L. Gao, S. Ren, Spectrochim, Acta Part A 61 (2005) 1136.

×