Thesis

SU
DEPA
INDIAN
UPERV
HYPE
SOUM
ARTME
N INSTITU
VISED L
ERSPE
B
MYADI
NT OF C
UTE OF
July
i
LEARN
CTRAL
By
IP CHA
CIVIL E
TECHN
y 2010
ING W
L DATA
ANDRA
ENGINEE
NOLOGY
WITH
A
ERING
KANPUUR

A D

SU
Dissertat
DEPA
INDIAN
UPERV
HYPE
tion Sub
Require
Ma
SOUM
ARTME
N INSTITU
VISED L
ERSPE
bmitted
ements
aster of
B
MYADI
(Y81
NT OF C
UTE OF
July
i
LEARN
CTRAL
In Parti
for the D
Techno
By
IP CHA
103044)
CIVIL E
TECHN
y 2010
ING W
L DATA
ial Fulfil
Degree o
ology
ANDRA
ENGINEE
NOLOGY
WITH
A
llment o
of
ERING
KANPU
of the
UR

iii

ABSTRACT
Hyperspectral data (HD) has ability to provide large amount of spectral
information than multispectral data. However, it suffers from problems like curse
of dimensionality and data redundancy. The size of data set is also very large.
Consequently, it is difficult to process these datasets and obtain satisfactory
classification results.
The objectives of this thesis are to find the best feature extraction (FE)
techniques and improvement in accuracy and time for classification of HD by
using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-
nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In
order to achieve these objectives, experiments were performed with different FE
techniques like segmented principal component analysis (SPCA), kernel principal
component analysis (KPCA), orthogonal subspace projection (OSP) and projection
pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations
in this thesis work.
From the experiments performed with the parametric and non-parametric
classifier, the GML classifier was found gave the best results with an overall
kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)
per class and 45 bands on SPCA feature extracted data set.
SVM algorithm with quadratic programming (QP) optimizer gave the best results
amongst all optimizers and approaches. The overall k-value of 96.91% was
achieved by using 300 TP per class and 20 bands of SPCA feature extracted data
set. However, the supervised FE techniques like KPCA and OSP failed to improve
results obtained by SVM significantly.
The best results obtained for GML, KNN and SVM were compared by the
one-tailed hypothesis testing. It was found that SVM classifier performed
significantly better than the GML classifiers for statistically large set of TP (300).
For statistically exact (100) and sufficient (200) set of TP, the performance of SVM
on SPCA extracted data set is statistically not better than the performance of
GML classifier.

iv

ACKNOWLEDGEMENTS
I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for
his involvement, motivation and encouragement throughout and beyond the thesis
work. His expert directions have inculcated in my qualities which I will treasure
throughout my life. His patient hearing, critical comments approach to the research
problem made me do better every time. His valuable suggestions to all stages of the
thesis work helped me to improvise various sorts of my shortcomings of my thesis
work. I also express my sincere thanks for his effort in going through the
manuscript carefully and making it more readable. It has been a great learning
and life changing experience working with him.
I would like to express my sincere tribute to Dr. Bharat Lohani for his
friendly nature, excellent guidance and teaching during my stay at IITK.
I would like to thank specially to Sumanta Pasari for his valuable
comments and corrections of the manuscript of my thesis.
I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,
Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI
peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,
pleasant and memorable one.
In closure, I express my cordial homage to my parents and my best friend
for their unwavering support and encouragement to complete my study at IITK
SOUMYADIP CHANDRA
July 2010

v

CONTENTS
CERTIFICATE………………………………………………………………………….. ii
ABSTRACTS........................................................................................................... iii
ACKNOWLEDGEMENTS……………………………………………………………. iv
CONTENTS………………………………………………………………………………...v
LIST OF TABLES………………………………………………………………………..ix
LIST OF FIGURES..................................................................................................x
LIST OF ABBREVIATIONS…………………………………………………………xiii
CHAPTER 1 - Introduction.........................................................................1
1.1 High dimensional space.......................................................................................2
1.1.1 What is hyperspectral data?.........................................................................2
1.1.2 Characteristics of high dimensional space..................................................3
1.1.3 Hyperspectral imaging .................................................................................4
1.2 What is classification? .........................................................................................5
1.2.1 Difficulties in hyperspectral data classification..........................................5
1.3 Background of work.............................................................................................6
1.4 Objectives .............................................................................................................7
1.5 Study area and data set used..............................................................................7
1.6 Software details ...................................................................................................9

vi

1.7 Structure of thesis ...............................................................................................9
CHAPTER 2 – Literature Review........................................................10
2.1 Dimensionality reduction by feature extraction..................................................10
2.1.1 Segmented principal component analysis (SPCA)........................................11
2.1.2 Projection pursuit (PP) ...............................................................................11
2.1.3 Orthogonal subspace projection (OSP) .....................................................12
2.1.4 Kernel principal component analysis (KPCA) .........................................12
2.2 Parametric classifiers........................................................................................13
2.2.1 Gaussian maximum likelihood (GML).......................................................13
2.3 Non–parametric classifiers ..............................................................................14
2.3.1 KNN .............................................................................................................14
2.3.2 SVM..............................................................................................................15
2.4 Conclusions from literature review ..................................................................19
CHAPTER 3 – Mathematical Background...................................21
3.1 What is kernel? ..................................................................................................21
3.2 Feature extraction techniques ..........................................................................24
3.2.1 Segmented principal component analysis (SPCA)....................................25
3.2.2 Projection pursuit (PP) ...............................................................................27
3.2.3 Kernel principal component analysis (KPCA) ..........................................34
3.2.4 Orthogonal subspace projection (OSP) ......................................................38

vii

3.3 Supervised classifier..........................................................................................43
3.3.1 Bayesian decision rule ................................................................................43
3.3.2 Gaussian maximum likelihood classification (GML):...............................44
3.3.3 k – nearest neighbor classification.............................................................44
3.3.4 Support vector machine (SVM): .................................................................46
3.4 Analysis of classification results.......................................................................58
3.4.1 One tailed hypothesis testing.....................................................................59
CHAPTER 4 - Experimental Design..................................................61
4.1 Feature extraction technique............................................................................62
4.1.1 SPCA ............................................................................................................62
4.1.2 PP .................................................................................................................62
4.1.3 KPCA............................................................................................................63
4.1.4 OSP...............................................................................................................64
4.2 Experimental design..........................................................................................64
4.3 First set of experiment (SET-I) using parametric and non-parametric
classifier........................................................................................................................66
4.4 Second set of experiment (SET-II) using advance classifier...............................67
4.5 Parameters......................................................................................................68
CHAPTER 5 - Results ....................................................................................69
5.1 Visual inspection of feature extraction techniques .........................................69

viii

5.2 Results for parametric and non-parametric classifiers...................................75
5.2.1 Results of classification using GML classifier (GMLC) ...........................75
5.2.2 Class-wise comparison of result for GMLC...............................................81
5.2.3 Classification results using KNN classifier (KNNC) ................................82
5.2.4 Class wise comparison of results for KNNC .............................................91
5.3 Experiment results for SVM based classifiers.................................................92
5.3.1 Experiment results for SVM_QP algorithm..............................................93
5.3.2 Experiment results for SVM_SMO algorithm...........................................97
5.3.3 Experiment results for KPCA_SVM algorithm.......................................100
5.3.4 Class wise comparison of the best result of SVM ...................................103
5.3.5 Comparison of results for different SVM algorithms .............................104
5.4 Comparison of best results of different classifiers.........................................105
5.5 Ramifications of results...................................................................................107
CHAPTER 6 - Summary of Results and Conclusions .......109
6.1 Summary of results..........................................................................................109
6.2 Conclusions.......................................................................................................112
6.3 Recommendations for future work .................................................................112
REFERENCES………………………………………………….……………….115
APPENDIX A……………………………………………………………………..120

ix

LIST OF TABLES
Table Title Page
2.1 Summary of literature review 18
3.1 Examples of common kernel functions 23
4.1 List of parameters 68
5.1 The time taken for each FE techniques 71
5.2 The best kappa values and z-statistic (at 5% significance values)
for GML
80
5.3 Ranking of FE techniques and time required to obtain the best k-
value
80
5.4 Classification with KNNC on OD and feature extracted data set 84
5.5 The best k-values and z-statistic for KNNC 89
5.6 Rank of FE techniques and time required to obtain best k-value 90
5.7 The best kappa accuracy and z-statistic for SVM_QP on different
feature modified data set
95
5.8 The best k-value and z-statistic for SVM_SMO on OD and different
feature modified data set
100
5.9 The best k-value and z-statistic for KPCA_SVM on original and
different feature modified data sets
104
5.10 Comparison of the best k-values with different FE techniques,
classification time, and z-statistic for different SVM algorithms
106
5.11 Statistical comparison of different classifier’s results obtained for
different data sets
107
5.12 Ranking of different classification algorithms depending on
classification accuracy and time. (Rank: 1 indicate the best)
109

x

LIST OF FIGURES
Figure Title Page
1.1 Hyperspectral image cube 2
1.2 Fractional volume of a hypersphere inscribed in hypercube decrease
as dimension increases
4
1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8
1.4 FCC obtained by first 3 principal components and superimposed
reference image showing training data available for classes
identified for study area
8
1.5 Google earth image of study area 9
3.1 Overview of FE methods 24
3.2 Formation of blocks for SPCA 26
3.2a Chart of multilayered segmented PCA 27
3.3 Layout of the regions for the chi-square projection index 30
3.4 (a) Input points before kernel PCA (b) Output after kernel PCA.
The three groups are distinguishable using the first component
only
37
3.5 Outline of KPCA algorithm 38
3.6 KNN classification scheme 45
3.7 Outline of KNN algorithm 46
3.8 Linear separating hyperplane for linearly separable data 49
3.9 Non-linear mapping scheme 52
3.10 Brief description of SVM_QP algorithm 54
3.11 Overview of KPCA_SVM algorithm 58
3.12 Definitions and values used in applying one-tail hypothesis testing 60
4.1 SPCA feature extraction method 62

xi

4.2 Projection pursuit feature extraction method 63
4.3 KPCA feature extraction method 63
4.4 OSP feature extraction method 64
4.5 Overview of classification procedure 66
4.6 Experimental scheme for Set-I experiments 67
4.7 The experimental scheme for advanced classifier (Set-II) 68
5.1 Correlation image of the original data set consisting of three
blocks having bands 32, 6 and 27 respectively
70
5.2 Projection of the data points. (a) Most interesting projection
direction (b) Second most interesting projection direction
71
5.3 First six Segmented Principal Components (SPCs) (b) shows water
body and salt lake
72
5.4 First six Kernel Principal Components (KPCs) obtained by using
400 TP
72
5.5 First six features obtained by using eight end-members 73
5.6 Two components of most interesting projections 73
5.7 Correlation images after applying various feature extraction
techniques
74
5.8 Overall kappa value observed for GML classification on different
feature extracted data sets using selected different bands
78
5.9 Comparison of kappa values and classification times for GML
classification method
81
5.10 Best producer accuracy of individual classes observed for GMLC
on different feature extracted data set with respect to different set
of TP
82
5.11 Overall accuracy observed for KNN classification of OD and
feature extracted data sets for 25 TP
85
86
87
88
5.15 Time comparison for KNN classification. Time for different bands 91

xii

at different neighbors for (a) 300 TP (b) 200 TP training data per
class
5.16 Comparison of best k-value and classification time for original and
feature extracted data set
91
5.17 Class wise accuracy comparison of OD and different feature
extracted data for KNNC
92
5.18 Overall kappa values observed for classification of FE modified
data sets using SVM and QP optimizer
94
5.19 Classification time comparison using 200 and 300 TP per class 97
5.20 Overall kappa values observed for classification of original and FE
modified data sets using SVM with SMO optimizer
100
5.21 Comparison of classification time different set of TPs with respect
to number of bands for SVM_SMO classification algorithm
101
5.22 Overall kappa values observed for classification original and feature
modified data sets using KPCA_SVM algorithm.
103
5.23 Comparison of classification accuracy of individual classes for
different SVM algorithms
105

xiii

LIST OF ABBREVIATIONS
AC
DAFE
DAIS
DBFE
FE
GML
HD
ICA
KNN
k-value
KPCA
KPCA_SVM
MS
NWFE
Ncri
OD
OSP
PCA
PCT
PP
rbf
SPCA
SV
SVM
SVM_QP
Advance classifier
Discriminant analysis feature extraction
Digital airborne imaging spectrometer
Decision boundary feature extraction
Feature extraction
Gaussian maximum likelihood
Hyperspectral data
Independent component analysis
k-nearest neighbors
Kappa value
Kernel principal component analysis
Support vector machine with Kernel principal component
analysis
Multispectral data
Nonparametric weighted feature extraction
Critical value
Original data
Orthogonal subspace projection
Principal component analysis
Principal component transform
Projection pursuit
Radial basic function
Segmented principal component analysis
Support vectors
Support vector machine
Support vector machine with quadratic programming optimizer

xiv

SVM_SMO
TP
Support vector machine with sequential minimal optimizer
Training pixels
Dedicated
to
my family & guide

ii

CHAPTER 1
INTRODUCTION
Remote sensing technology has brought a new dimension in the field of earth
observation, mapping and in many other different fields. At the beginning of this
technology, multispectral sensors were used for capturing data. The multispectral
sensors capture data in a small number of bands with broad wavelength intervals.
Due to few spectral bands, their spectral resolution is insufficient to discriminate
amongst many earth objects. But if the spectral measurement is performed by using
hundreds of narrow wavelength bands, then several earth objects could be
characterized precisely. This is the key concept of hyperspectral imagery.
As compared to multispectral (MS) data set, hyperspectral data (HD) has large
information content, voluminous and also different in characteristics. So, the
extraction of that huge information from HD remains a challenge. Therefore, some
cost effective and computationally efficient procedures are required to classify the
HD. Data classification is the categorization of data for its most effective and efficient
use. As a result of classification, we need a high accuracy thematic map. HD has that
potentiality.
This chapter will provide the concept of high dimensional space, HD and
difficulties in classification of HD. Next part focuses on the objectives of the thesis
followed by an overview of data set used in this thesis. Details of the software used
are mentioned in the next part of this chapter followed by the structure of thesis.
1.1 High dimensional space
In Mathematics, an n-dimensional space is a topological space whose
dimension is n (where n is a fixed natural number). One of the typical example is n-
dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.

2

n-dimensional spaces with large values of n are sometimes called high-dimensional
spaces (Werke, 1876). Many familiar geometric objects can be expressed by some
number of dimensions. For example, the two-dimensional triangle and the three-
dimensional tetrahedron can be seen as specific instances of the n-dimensional space.
In addition, the circle and the sphere are particular form of the n-dimensional
hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).
1.1.1 What is hyperspectral data?
When spectral measurement is done by using hundreds of narrow contiguous
wavelength intervals then the captured image is called Hyperspectral image. Mostly,
the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In
this cube, x and y axes specify the size of image and λ axis specifies the dimension or
the number bands. Hyperspectral sensors corresponding to each band collect
information as a set of images. Each image represents a range of the electromagnetic
spectrum for each band.
Figure 1.1: Hyperspectral image cube (Richards and Jia, 2006)
These images are then combined and form a three dimensional hyperspectral
cube. As the dimension of the HD is very high, it is comparable with the high
dimensional space. HD follows same characteristics like high dimensional space
which are described in the following section.

3

1.1.2 Characteristics of high dimensional space
High dimensional spaces, spaces with a dimensionality greater than three,
have properties that are substantially different from normal sense of distance,
volume, and shape. In particular, in a high-dimensional Euclidean space, volume
expands far more rapidly with increasing diameter in compared to lower-dimensional
spaces, so that, for example:
(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin
shell near its outer "surface"
(ii). The volume within a high-dimensional hypersphere relative to a hypercube of
the same width tends to zero as dimensionality tends to infinity, and almost all
of the volume of the hypercube is concentrated in its "corners".
The above mentioned characteristics have two important consequences for high
dimensional data that appear immediately. The first one is, high dimensional space is
mostly empty. As a consequence, high dimensional data can be projected to a lower
dimensional subspace without losing significant information in terms of separability
among the different statistical classes (Jimenez and Landgrebe, 1995). The second
consequence of the foregoing is, normally distributed data will have a tendency to
concentrate in the tails; similarly, uniformly distributed data will be more likely to be
collected in the corners, making density estimation more difficult. Local
neighborhoods are almost empty, requiring the bandwidth of estimation to be large
and producing the effect of losing detailed density estimation (Abhinav, 2009).

4

Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube
Figure 1.2: Fractional volume of a hypersphere inscribed in hypercube
decreases as dimension increases (Modified after Jimenez,
Landgrebe, 1995)
1.1.3 Hyperspectral imaging
Hyperspectral imaging collects and processes information using the
electromagnetic spectrum. Hyperspectral imagery makes difference between many
types of earth’s objects, which may appear as the same color to the human eye.
Hyperspectral sensors look at objects using a vast portion of the electromagnetic
spectrum. The whole process of hyperspectral imaging can be divided into three steps:
preprocessing, radiance to reflectance transformation and data analysis (Varshney
and Arora, 2004).
In particular, preprocessing is required to convert the raw radiance to sensor
radiance. The processing steps contain the operations like spectral calibration,
geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and
geometric accuracy of hyperspectral data is significantly different from one band to
another band (Varshney and Arora, 2004).

5

1.2 What is classification?
Classification means to put data into groups according to their characteristics.
In the case of spectral classification, the areas of the image that have similar spectral
reflectance are put into same group or class (Abhinav, 2009). Classification is also
seen as a means of compressing image data by reducing the large range of digital
number (DN) in several spectral bands to a few classes in a single image.
Classification reduces this large spectral space into relatively few regions and
obviously results in loss of numerical information from the original image. Depending
on the availability of information of the region which is imaged, supervised or
unsupervised classification methods are performed.
1.2.1 Difficulties in hyperspectral data classification
Though it is possible that HD can provide a high accuracy thematic map than
MS data, there are some difficulties in classification in case of high dimensional data
as listed below:
1. Curse of dimensionality and Hughes phenomenon: It says that when
the dimensionality of data set increases with the number of bands, the
number of training pixels (TP) required for training a specific classifier
should be increased as well to achieve the desired accuracy for
classification. It becomes very difficult and expensive to obtain large
number of TP for each sub class. This has been termed as “curse of
dimensionality” by Bellman (1960), which leads to the concept of “Hughes
phenomenon” (Hughes, 1968).
2. Characteristics of high dimensional space: The characteristics of high
dimensional space have been discussed in above section (Sec. 1.1.2). For
those reasons, the algorithms that are used to classify the multispectral
data often fail for hyperspectral data.
3. Large number of highly correlated bands: Hyperspectral sensor uses
the large number of contiguous spectral bands. Therefore, among these
bands, some bands are highly correlated. These correlated bands do not
provide good result in classification. Therefore, the important task is to

6

select the uncorrelated bands or make the bands uncorrelated, applying
feature reduction algorithms (Varshney and Arora, 2004).
4. Optimum number of feature: It is very critical to select the optimum
number of bands out of large number of bands (e.g. 224 bands for AVIRIS
image) to use in classification. Till today there are no suitable algorithms or
any rule for selection of optimal number of features.
5. Large data size and high processing time due to complexity of
classifier: Hyperspectral imaging system provides large amount of data. So
large memory and powerful system is necessary to store and handle the
data, generally which is very expensive.
1.3 Background of work
This thesis work is the extension of work done by Abhinav Garg (2009) in his
M.Tech thesis. In his thesis, he showed that among the conventional classifiers
(gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER),
GML provides the best result. The performance of GML is improved significantly
after applying feature extraction (FE) techniques. Principal component analysis
(PCA) was found to be working best, among all FE techniques (discriminant analysis
FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and
independent component analysis(ICA)), in improving classification accuracy of GML.
For the advance classifier, SVM’s result does not depend on the choice of
parameters but ANN’s does. He also showed SVM’s result was improved by using
PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE
failed to improve it significantly.
He showed some drawbacks for advanced classifier like SVM and suggested
some FE techniques which may improve the result for conventional classifier (CC) as
well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes
more processing time than small size of TP. The objectives of this thesis work are to
sort out these problems and to find the best FE technique, which will improve the
classification result for HD. In next article, the objective of this thesis work has been
described.
.

7

1.4 Objectives
This thesis has investigated the following two objectives pertaining to
classification with hyperspectral data:
Objective-1:
To evaluate various FE techniques for classification of hyperspectral data.
Objective-2
To study the extent to which advance classifier can reduce problems related to
classification of hyperspectral data.
1.5 Study area and data set used
The study area for this research is located within an area known as 'La
Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig.
1.4). The area is mainly used for cultivation of wheat, barley and other crops such as
vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on
29th June, 2000, at 5 m resolution.
Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an
exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to
2.5 μm were selected for further analysis (Pal, 2002). Striping problems were
observed between bands 41 and 72. All the 72 bands were visually examined and 7
bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were
removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels
covering the area of interest was extracted (Abhinav, 2009).
The data set available for this research work includes the 65 (retained after
pre-processing) bands data and the reference image, generated with the help of field
data collected by local farmers as briefed in Pal (2002). The area included in imagery
was found to be divided into eight different land cover types, namely wheat, water
body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built
up area.

8

Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)
Figure 1.4: FCC obtained by first 3 principal components and superimposed
reference image showing training data available for classes
identified for study area (Pal, 2002).

9

Figure 1.5: Google earth image of study area (Google earth, 2007)
1.6 Software details
For the processing of HD very power full system is required due to the size of
data set and complexity of algorithms. The machine used for this thesis work
contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7.
Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results
are obtained here from same machine for the comparison of different algorithm.
1.7 Structure of thesis
The present thesis is organized into six chapters. Chapter1 focuses on the
characteristics of high dimensional space, challenges of HD classification and outline
of the experiments of this thesis work. Also it discusses the study region, data set and
the software used in this thesis work. Chapter 2 presents the detailed description of
the HD classification and the previous research work related to this domain. Chapter
3 describes the detailed mathematical background of the different processes used in
this work. Chapter 4 outlines the detailed methodology carried out for this thesis
work. Chapter 5 presents the experiments which are conducted for this thesis
followed by interpretation. Chapter 6 provides the conclusions for present work and
the scopes for future works.

10

CHAPTER 2
LITERATURE REVIEW
This chapter outlines the important research works and major achievements in
the field of high dimensional data analysis and data classification. The chapter begins
with some of the FE techniques and classification approaches, for solving problems
related to HD classification as suggested by various researchers. The results of useful
experiments with the HD will also be included to highlight the usefulness and
reliability of these approaches. These results are presented in tabulated form. Some
other issues related to classification of HD are also discussed at the end of this
chapter.
2.1 Dimensionality reduction by
Swain and Davis (1978) mentioned details of various separability measures for
multivariate normal class models. Various statistical classes are found to be
overlapping which causes error of misclassification as most of the classifiers use
decision boundary approach for classification. The idea was to obtain such a
separability measure which could give an overall estimate of range of classification
accuracies that can be achieved by using a sub-set of selected features so that the
sub-set of features corresponding to highest classification accuracy can be selected for
classification (Abhinav, 2009).
FE is the process of transforming the given data from a higher dimensional
space to a lower dimensional space while conserving the underlying information
(Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the
underlying information spread in high dimensional space by containing it into
comparatively smaller number of dimensions without loss of significant amount of
useful information. FE techniques, in case of classification, try to enhance class
separability while reducing data dimensionality (Abhinav, 2009).

11

2.1.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data for feature reduction. Also it can be used as the tool of image
enhancement and digital change detection (Lodwick, 1979). For the case of dimension
reduction of HD, PCA outperforms those FE techniques which are based on class
statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited
and ratio to the number of dimension is low for HD, class covariance matrix cannot be
estimated properly. To overcome these problems Jia (1996) proposed the scheme for
segmented principal component analysis (SPCA) which applies PCT on each of the
highly correlated blocks of bands. This approach also reduces the processing time by
converting the complete set of bands into several highly correlated bands. Jensen and
James (1999) proposed that the SPCA-based compression generally outperforms
PCA-based compression in terms of high detection and classification accuracy on
decompressed HD. PCA works efficiently for the highly correlated data set but SPCA
works efficiently for both high correlated as well as low correlated data sets (Jia,
1996).
Jia (1996) compared SPCA and PCA extracted features for target detection and
concluded SPCA as a better FE technique than PCA. She also showed that both
feature extracted data sets are identical and there is no loss of variance in the middle
stages, as long as no components are removed.
2.1.2 Projection pursuit (PP)
Projection pursuit (PP) methods were originally posed and experimented by
Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman
and Tukey (1974). They described PP as a way of searching for and exploring
nonlinear structure in multi-dimensional data by examining many 2-D projections.
Their goal was to find interesting views of high dimensional data set. The next stages
in the development of the technique were presented by Jones (1983) who, amongst
other things, developed a projection index based on polynomial moments of the data.
Huber (1985) presented several aspects of PP, including the design of projection
indices. Friedman (1987) derived a transformed projection index. Hall (1989)
developed an index using methods similar to Friedman, and also developed

12

theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)
introduced a projection index called the chi-square projection pursuit index. Posse
(1995a, 1995b) used a random search method to locate a plane with an optimal value
of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. Each projection found in this
manner shows a structure that is less important (in terms of the projection index)
than the previous one. Most recently, the PP technique can also be used to obtain 1-D
projections (Martinez, 2005). In this research work, Posse’s method is followed that
reduces n-dimensional data set to 2-dimensional data.
2.1.3 Orthogonal subspace projection (OSP)
Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP)
method which simultaneously reduces the data dimensionality, suppresses undesired
or interfering spectral signatures, and detects the presence of a spectral signature of
interest. The concept is to project each pixel vector onto a subspace which is
orthogonal to the undesired pixel. In order to make the OSP to be effective, number of
bands must not be taken less than the number of signatures. It is a big limitation
associated with multispectral image. To overcome this, Ren and Chang (2000)
presented the Generalized OSP (GOSP) method that relaxes this constraint in such a
manner that the OSP can be extended to multispectral image processing in an
unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci,
2001) and also for magnetic resonance image classification (Wang et.al, 2001).
2.1.4 Kernel principal component analysis (KPCA)
Linear PCA always detect all structure in a given data set. By the use of
suitable nonlinear feature extractor, more information can be extracted from the data
set. The kernel principal component analysis (KPCA) can be used as a strong
nonlinear FE method (Scholkopf and Smola, 2002) which maps the input vectors to
feature space and then PCA is applied on the mapped vectors. KPCA is also a
powerful method for preprocessing steps for classification algorithm (Mika et. al.
1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature
selection in a high-dimensional feature space where input variables were mapped by

13

a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of
the higher-order statistics. To obtain this higher-order statistics, a large number of
TP is required. This causes problems for KPCA, since KPCA requires storing and
manipulating the kernel matrix whose size is the square of the number of TP. To
overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian
Algorithm (KHA) was introduced by (Scholkopf et. al., 2005).
2.2 Parametric classifiers
Parametric classifiers (Fukunaga, 1990) require some parameters to develop
the assumed density function model for the given data. These parameters are
computed with the help of a set of already classified or labeled data points called
training data. It is a subset of given data for which the class labels are known and is
chosen by sampling techniques (Abhinav, 2009). It is used to compute some class
statistics to obtain the assumed density function for each class. Such classes are
referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon
the training data and may differ from the actual classes.
2.2.1 Gaussian maximum likelihood (GML)
Maximum likelihood method is based on the assumption that the frequency
distribution of the class membership can be approximated by the multivariate normal
probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one
of the most popular parametric classifiers that has been used conventionally for
purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages
of GML classification method are that, it can obtain minimum classification error
under the assumption that the spectral data of each class is normally distributed and
it not only considers the class centre but also its shape, size and orientation by
calculating a statistical distance based on the mean values and covariance matrix of
the clusters (Lillesand et al., 2002).
Lee and Landgrebe (1993) compared the result of GML classifier on PCA and
DBFE feature extracted data set and concluded that DBFE feature extracted data set
provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE
techniques were compared for classification accuracy achieved by nearest neighbor

14

and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is
better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,
DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed
that PCA is the best FE technique for HD among the other mentioned feature
extractor for GML classifier. He also suggested that some FE techniques like KPCA,
OSP, SPCA, PP may improve the classification result using GML classifier.
2.3 Non–parametric classifiers
The non–parametric classifiers (Fukunaga, 1990) uses some control
parameters, carefully chosen by the user, to estimate the best fitting function by
using an iterative or learning algorithm. They may or may not require any training
data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor
(KNN) (Cover and Hart, 1967) are two popular working classifiers under this
category. Edward (1972) gave brief descriptions of many non-parametric approaches
for estimation of data density functions.
2.3.1 KNN
KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern
recognition. The technique can achieve high classification accuracy in problems which
have unknown and non-normal distributions. However, it has a major drawback that
a large amount of TP is required in the classifiers resulting in high computational
complexity for classification (Hwang and Wen, 1998).
Pechenizkiy (2005) compared the performance of KNN classifier on the PCA
and random projection (RP) feature extracted data set. He concluded that KNN
performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the
KNN works better on the ICA feature extracted data set than the original data set
(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-
KNN method with a few wavelengths had the same performance as the KNN
classifier alone using information from all wavelengths.
Some more non–parametric classifiers based on geometrical approaches of data
classification were found during literature survey. These approaches consider the
data points to be located in the Euclidean space and exploit the geometrical patterns
of the data points for classification. Such approaches are grouped into a new class of

15

classifiers known as machine learning techniques. Support Vector Machines (SVM)
(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among
the popular classifiers of this kind. These do not make any assumptions regarding
data density function or the discriminating functions and hence are purely non–
parametric classifiers. However, these classifiers also need to be trained using the
training data.
2.3.2 SVM
SVM has been considered as advance classifier. SVM is a new generation of
classification techniques based on Statistical Learning Theory having its origins in
Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,
1998) discussed SVM based classification in detail. SVM tends to improve learning by
empirical risk minimization (ERM) to minimize learning error and to minimize the
upper bound on the overall expected classification error by structural risk
minimization (SRM). SVM makes use of principle of optimal separation of classes to
find a separating hyperplane that separates classes of interest to maximum extent by
maximizing the margin between the classes (Vapnik, 1992). This technique is
different from that of estimation of effective decision boundaries used by Bayesian
classifiers as only data vectors near to the decision boundary (also known as support
vectors) are required to find the optimal hyperplane. A linear hyperplane may not be
enough to classify the given data set without error. In such cases, data is transformed
to a higher dimensional space using a non–linear transformation that spreads the
data apart such that a linear separating hyperplane may be found. Kernel functions
are used to reduce the computational complexity that arises due to increased
dimensionality (Varshney and Arora, 2004).
Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization
capability and ability to adapt their learning characteristics by using kernel functions
due to which they can adequately classify data on a high–dimensional feature space
with a limited number of training data sets and are not affected by the Hughes
phenomenon and other affects of dimensionality. The ability to classify using even
limited number of training samples make SVM as a very powerful classification tool
for remotely sensed data. Thus, SVM has the potential to produce accurate
classifications from HD with limited number of training samples. SVMs are believed

16

to be better learning machines than neural networks, which tends to overfit classes
causing misclassification (Abhinav, 2009), as they rely on margin maximization
rather than finding a decision boundary directly from the training samples.
For conventional SVM an optimizer is used based on quadratic programming
(QP) or linear programming (LP) methods to solve the optimization problem. The
major disadvantage of QP algorithm is the storage requirement of kernel matrix in
the memory. When the size of the kernel matrix is large enough, it requires huge
memory that may not be always available. To overcome this Benett and Campbell
(2000) suggested an optimization method which sequentially updates the Lagrange
multipliers called the kernel adatron (KA) algorithm. Another approach was
decomposition method which updates the Lagrange multipliers in parallel since they
update many parameters in each iteration unlike other methods that update
parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which
updates lagrange multipliers on the fixed size working data set. Decomposition
method uses QP or LP optimizer to solve the problem of huge data set by considering
many small data sets rather than a single huge data set (Varshney, 2001). The
sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of
decomposition method when the size of working data set is fixed such that an
analytical solution can be derived in very few numerical operations. This does not use
the QP or LP optimization methods. This method needs more number of iterations
but requires a small number of operations thus results in an increase in optimization
speed for very large data set.
The speed of SVM classification decreases as the number of support vectors
(SV) decreases. By using kernel mapping, different SVM algorithms have successfully
incorporated effective and flexible nonlinear models. There are some major difficulties
for large data set due to calculation of nonlinear kernel matrix. To overcome the
computational difficulties, some authors have proposed low rank approximation to
the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)
have proposed the method of reduced support vector machine (RSVM) which reduces
the size of the kernel matrix. But there was a problem of selecting the number of
support vectors (SV). In 2009, Sundaram proposed a method which will reduce the
number of SV through the application of KPCA. This method is different from other

17

proposed method as the exact choice of support vector is not important as long as the
vector spanned a fixed subspace.
Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he
used linear SVM on the feature extracted data set and showed that KPCA features
are more linearly separable than the features extracted by conventional PCA. Shah et
al (2003) compared SVM, GML and ANN classifiers for accuracies at full
dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and
concluded that SVM gives higher accuracies than GML and ANN for full
dimensionality but poor accuracies for features extracted by DAFE and DBFE.
Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,
DBFE, DAFE feature extracted data set. He concluded that SVM provides better
result for OD than GML. SVM works best with PCA and ICA feature extracted data
set where ANN works better with DBFE and NWFE feature extracted data set.
The works done by various researchers with different hyperspectral data sets
using different classifiers and FE methods and the results obtained by them is
summarized in Table 2.1.

18

Table 2.1: Summary of literature review
Author Dataset used Method used Results obtained
Lee and Landgrebe
(1993)
Field Spectrometer
System (airborne
hyperspectral
sensor)
GML classifier is used to
compare classification
accuracies obtained by
DBFE and PCA FE
Features extracted by DBFE
produces better classification
accuracies than those
obtained from PCA and
Bhattacharya feature
selection methods.
Jimenez and
Landgrebe (1998)
Stimulated and real
AVIRIS data
Hyperspectral data
characteristics were
studied with respect to
effects of
dimensionality, order of
data statistics used on
supervised classification
techniques.
Hughes phenomenon was
observed as an effect of
dimensionality and
classification accuracy was
observed to be increasing
with use of higher statistics
order. But lower order
statistics were observed to
be less affected by Hughes
phenomenon.
Benediktsson et al
(2001)
ROSIS-03 KPCA and PCA feature
extracted data set was
used for classification
using linear SVM.
KPCA features are more
linearly separable than
features extracted by
conventional PCA.
Shah et al. (2003) AVIRIS Compared SVM, GML
and ANN classifiers for
accuracies at full
dimensionality and
using DAFE and DBFE
feature extraction
techniques
SVM was found to be giving
higher accuracies than GML
and ANN for full
dimensionality but poor
accuracies were obtained for
features extracted by DAFE
and DBFE.
Kuo and Landgrebe
(2004)
Stimulated and real
data (HYDICE
image of DC mall,
Washington, US)
NWFE and DAFE FE
techniques were
compared for
classification accuracy
achieved by nearest
neighbor and GML
classifiers.
NWFE was found to be
producing better
classification accuracies
than DAFE.
Pechenizkiy (2005) 20 data sets with
different
characteristics were
taken from the UCI
machine learning
repository.
KNN classifier was used
to compare classification
accuracies obtained by
PCA and Random
Projection FE
PCA gave the better result than
Random Projection
Zhu et al (2007) Hyperspectral
imaging system
developed by ISL.
ICA ranking methods
were used to select the
optimal wave length the
KNN was used. Then
KNN alone was used.
ICA-KNN method with a few
band had the same
performance as the KNN
classifier alone using all
bands.
Sundaram (2009) The adult dataset
,part of UCI
Machine Learning
Repository
KPCA was applied in
the support vector, then
usual SVM algorithm is
used
Significantly reduce the
processing time without
effecting the classification
accuracy

19

Abhinav (2009) DAIS 7915 GML, SAM, MDM
classification techniques
were used on the PCA,
ICA, NWFE, DBFE and
DAFE feature extracted
data set
GML was the best among
the other techniques and
performs best on PCA
extracted data set.
Abhinav (2009) DAIS 7915 SVM and GML
classification techniques
were used on the OD
and PCA, ICA, NWFE,
DBFE and DAFE
feature extracted data
set to compare the
accuracy
GML performed very low in
OD than SVM. SVM provide
better accuracy than GML.
SVM performs better on
PCA and ICA extracted data
set.
2.4 Conclusions from literature review

1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,
ICA, DAFE, DBFE and NWFE perform well in improving the classification
accuracies when used with GML. But the features extracted by DBFE and
DAFE failed to improve results obtained by SVM implying a limitation of these
techniques for the advance classifiers. KNN works best with PCA and ICA
feature extracted data set. However, in the surveyed literature the effects of
PP, SPCA, KPCA and OSP extracted features on classification accuracy
obtained from the advance classifiers like SVM, parametric classifier like GML
and nonparametric classifier KNN have not been observed.
2. Another important aspect found missing in the literature is the comparison of
classification time for SVM classifiers because SVM takes long time for
training using large TP. It was seen that many approach of SVM were
proposed to reduce the classification time but there is no conclusion for the best
SVM algorithm depending on classification accuracy and processing time.
3. Although KNN is effective classification technique for HD, there is no guideline
for classification time or suggestion of best FE techniques for KNN classifier.
Also the effect of different parameters like number of nearest neighbor,
number of TP, number of bands is not suggested for KNN.

20

4. During the literature survey, it is further found that there is no suggestion for
the best FE techniques for different SVM algorithms, GML and KNN.
Such missing aspects will be investigated in this thesis work and the
guidelines to choose an efficient and less time consuming classification technique
shall be presented as the result of this research.
This chapter presented the FE and classification techniques for mitigating the
effects of dimensionality. These techniques were result of different approaches used
to deal with the problem of high dimensionality and improving performance of
advance, parametric and nonparametric classifier. The approaches were applied on
real life HD and comparative results as reported in literature were compiled and
presented here. In addition, the important aspects found missing in the literature
survey were highlighted which this thesis work shall try to investigate. The
mathematical rationale and algorithms used to apply these techniques will be
discussed in detail in the next chapter.

21

CHAPTER 3
MATHEMATICAL BACKGROUND
This chapter will provide the detailed mathematical background of each of the
techniques used in this thesis. Starting with the some basic concepts of kernels and
kernel space this chapter will describe the unsupervised and supervised FE
techniques followed by classification and optimization rules for supervised classifier.
Finally, the scheme for statistical analysis which has been used for comparing the
results of different classification techniques are discussed.
Notations which are followed in this chapter for matrix and vector are given
below:
X A two dimensional matrix, whose columns represent the data points (m) and
rows represent number of bands (n), where ,X X n m= ⎡ ⎤⎣ ⎦.
ix n -dimensional single pixel column vector where 1 2, ......., mX x x x= ⎡ ⎤⎣ ⎦and
1 2, ,.....,
T
i i i nix x x x= ⎡ ⎤⎣ ⎦
jc Represents jth class.
( )zΦ Mapping of the input vector z in kernel space, using some kernel function.
,a b Defines inner product of the vectors a and b.
∈ Belongs to
n
R Set of n-dimensional real number.
N Set of natural number.
T
⎡ ⎤⎣ ⎦ Denotes the transpose of a matrix.
∀ For all.
3.1 What is kernel?
Before defining kernel, let’s look at the following two definitions:
• Input space: The space where originally data points lie.

22

• Feature space: The space spanned by the transformed data points (from
original space) which were mapped by some functions.
Kernel is the dot product in feature space H via a map Φ from input space,
such that :X HΦ → . Kernel can be defined as ( , ') ( ), ( ')k x x x x= Φ Φ , where
, ' and ( ), ( ')x x x xΦ Φ are the elements of input space and feature space respectively
and k is called the kernel and Φ is called feature map associated with k. Φ also can
be called as the kernel function. The space containing these dot products is called
kernel space. This is a nonlinear mapping from input space to feature space which
increases the internal distance between two points in a data set. This means that the
data set which is nonlinearly separable in input space becomes linearly separable in
kernel space. A few definitions related to kernel are given below:
Gram matrix: Given a kernel k and inputs 1 2, ........., nx x x X∈ , the xn n matrix,
: ( ( , ))i j ijK k x x= is called the gram matrix of k with respect to 1 2, ........., nx x x X∈ .
Positive definite matrix: A real xn n symmetric matrix K satisfying 1 1 0T
x Kx > for
all ( )1 11 21 1, ,.......,
T n
nx x x x R= ∈ is called positive definite. 1x is a column vector. If the
equality in previous equation occurs only for 11 21 1........ 0nx x x= = = = , then the matrix
is called strictly positive definite.
Positive definite kernel: Let X be a nonempty set. A function :k X X R× → , ∀
, ,in N x X i N∈ ∈ ∈ if it gives rise to a positive definite gram matrix, is called a
positive definite kernel. A function :k X X R× → ∀ n N∈ and distinct ix X∈ if it
gives rise to a strictly positive definite gram matrix, called strictly positive definite
kernel.
Definitions of some commonly used kernel functions are shown in Table 3.1.

23

Table 3.1: Examples of common kernel functions (Modified after Varshney and
Arora, 2004)
Kernel function type Definition
( , )iK x x
Parameters Performance depends on
Linear ix x× Decision boundary either
linear or non linear
Polynomial with
degree n
( 1)n
ix x× + n is a positive integer
User defined parameters
Radial basis function
2
2
( - )
exp
2
ix x
σ
⎛ ⎞
⎜ ⎟−
⎜ ⎟
⎝ ⎠
σ is a user defined
value
Sigmoid tanh( ( . ) )ik x x + Θ K and Θ are user
defined parameter
All the above definitions have been explained with the following simple
example.
Let, 1 2 3
1 2 1
2 1 3
1 1 3
X x x x
⎡ ⎤
⎢ ⎥= =⎡ ⎤⎣ ⎦ ⎢ ⎥
⎢ ⎥⎣ ⎦
is a matrix in input space whose columns ( , 1,2,3ix i = )
denote the number of data points and rows denote the dimension of data points.
Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.
Let ,i jx x denotes the inner product of the columns of the matrix X using Gaussian
kernel function.
Then the gram matrix (kernel matrix) K takes precisely the form,
1 1 1 2 1 3
2 1 2 2 2 3
3 1 3 2 3 3
, , ,
, , ,
, , ,
x x x x x x
K x x x x x x
x x x x x x
⎡ ⎤
⎢ ⎥
= ⎢ ⎥
⎢ ⎥
⎢ ⎥⎣ ⎦
The numerical value of the matrix K is,
1.0000 0.0498 0.0821
0.0498 1.0000 0.6065
0.0821 0.6065 1.0000
K
⎡ ⎤
⎢ ⎥= ⎢ ⎥
⎢ ⎥⎣ ⎦
K is symmetric matrix. If the matrix K turns out to be positive definite, then it is
called positive definite kernel and if it is strictly positive definite, then it is called
strictly positive definite kernel.

24

3.2 Feature extraction techniques
FE techniques are based on a simple assumption that given data sample
( : )n
x X R∈ belonging to an unknown probability distribution in n-dimensional space
can be represented by some coordinate system in m dimensional space (Carreira-
Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate
system such that when the data points from higher dimensional space are projected
onto it, a dimensionally compact representation of these data points will be obtained.
There are two following main conditions to obtain an optimal dimension reduction
(Carreira-Perpinan, 1997):
(i) Elimination of dimensions with very low information content. Features with
low information content can be discarded as noise.
(ii) Remove redundancy among the dimensions of data space i.e. the reduced
feature set should be spanned by orthogonal vectors.
The unsupervised and supervised FE techniques have been investigated in this
research work (Figure 3.1). For the unsupervised approach, segmented principal
component analysis (SPCA), projection pursuit (PP) and for supervised FE technique,
kernel principal component analysis (KPCA) and orthogonal subspace projection
(OSP) are used. The next sub-sections will discuss the assumptions used by these FE
techniques in detail.
Figure 3.1: Overview of FE methods

25

3.2.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral
image data, PCT outperforms those FE techniques which are based on the class
statistics. The main advantage of using a PCT is that global statistics are used to
determine the transform functions. Implementation of PCT on high dimensional data
set requires high computational load. SPCA can overcome the problem of long
processing time by partitioning the complete data set into several highly correlated
subgroups (Jia, 1996).
The complete data set is first partitioned into K subgroups with respect to the
correlation of bands. From the correlation image of HD, it can be seen that blocks are
formed from highly correlated bands (Figure 3.2). These blocks are selected as the
subgroups. Let 1n , 2n and kn are the number of bands in subgroups 1, 2 and k
respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After
applying PCT on each subgroup, significant features are selected by variance
information of each component. The PCs which contain about 99% variance were
chosen for each block then the selected features can be regrouped and transformed
again to compress the data further.

26

Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27
bands respectively, corresponding to highly correlated bands have been
formed from the correlation image of HYDICE hyperspectral sensor data.
Segmented PCT retains all the variance as with the conventional PCT. There
is no information lost either in the case that the transformation is conducted on the
complete vector at once or a few sub vectors are transformed separately (Jia, 1996).
When the new components obtained from each segmented PCT are gathered and
transformed again, then the resulting data variance and covariance are identical to
those of the conventional PCT. The main effect is that, the data compression rate is
lower in the middle stages compared to the no segmentation case. However, it makes
a relatively small difference in compression rate, if segmented transformation is
developed on those subgroups which have poor correlation with each other.

27

Figure 3.2a: Chart of multilayered segmented PCA
3.2.2 Projection pursuit (PP)
Projection pursuit (PP) refers to a technique first described by Friedman and
Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by
means of selected low dimensional linear projections. To reach this goal, an objective
function is assigned, called projection index, to every projection characterizing the
structure present in the projection. Interesting projections are then automatically
picked up by optimizing the projection index numerically. The notion of interesting
projections has usually been defined as the ones exhibiting departure from normality
(normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985).
Posse (1990) proposed an algorithm based on a random search and a chi-
squared projection index for finding the most interesting plane (two-dimensional
view). The optimization method was able to locate in general the global maximum of
the projection index over all two-dimensional projections (Posse, 1995). The chi-
squared index was efficient, being fast to compute and sensitive to departure from
normality in the core rather than in the tail of the distribution. In this investigation
only chi-squared (Posse, 1995a, 1995b) projection index has been used.

28

Projection pursuit exploratory data analysis (PPEDA) consists of following two parts:
(i) A projection pursuit index measures the degree of departure from normality.
(ii) A method for finding the projection that yields the highest value for the index.
Posse (1995a, 1995b) used a random search to locate a plane with an optimal
value of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. The interesting projections are
found in decreasing order of the value of the PP index. This implies that each
projection found in this manner shows a structure that is less important (in terms of
the projection index) than the previous one. In the following discussion, first the chi-
squared PP index has been described followed by the structure finding procedure.
Finally, the structure removal procedure is illustrated.
3.2.2.1 Posse chi-square index
Posse proposed an index based on the chi-square index. The plane is first
divided into 48 regions or boxes kB , 1,2,..,48k = that are distributed in the form of
rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the
same angular width of 0
45 . R is chosen so that the boxes have approximately the
same weight under normally distributed data and which is equal to
( )
1
22log6
5
. The
outer boxes were having weight 1/48 under normally distributed data. This choice for
the radial width provides regions with approximately same probability for the
standard bivariate normal distribution (Martinez, 2001). The projection index is
given as:
( ) ( )2
2
8 48
( ) ( )
0 1 1
1 1 1
, ,
9
j j
k
n
B i i k
j k ik
PI I z z c
c n
α λ β λ
χ
α β
= = =
⎡ ⎤
= −⎢ ⎥
⎣ ⎦
∑∑ ∑ (3.1)
Where,
φ The standard bivariate normal density.
kc Probability evaluated over kth region using the normal density function,
given by 1 2
k
k
B
c dz dzφ= ∫∫ .

29

kB Box in the projection plane.
jλ , 0,.....,8
36
j
j
π
= is the angle by which the data are rotated in the plane
before being assigned to regions.
,α β Orthonormal p-dimensional vectors which span the projection plane (It
can be first two PCs or randomly chosen two pixels of the OD set).
( , )P α β A plane consists of two orthonormal vectors ,α β
,i jZ Zα β
Sphered observations projected onto the vectors andα β . ( T
i iZ Zα
α= and
T
i iZ Zβ
β= )
( )jα λ cos sinj jα λ β λ−
( )jβ λ sin cosj jα λ β λ+
kBI The indicator functions for region.
( )2 ,PIχ
α β The chi-squareprojection index evaluated using the data projected onto
the plane spanned byα and β .
The chi-square projection index is not affected by the presence of outliers.
However, it is sensitive to distributions that have a hole in the core, and it will also
yield projections that contain clusters. The chi-square projection pursuit index is fast
and easy to compute, making it appropriate for large sample sizes. Posse (1995a)
provides a formula to approximate the percentiles of the chi-square index.

30

R
R/5
45o
1/48 1/48
1/48
1/48
1/48
1/48
1/48
1/48

Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after
Posse, 1995a)
3.2.2.2 Finding the structure (PPEDA algorithm)
For PPEDA projection pursuit index, ( )2 ,PIχ
α β must be optimized over all
possible direction onto 2-D planes. Posse (1990) proposed a random search for
locating the global maximum of the projection index. Combined with the structure-
removal procedure, this gives a sequence of interesting bi-dimensional views of
decreasing importance. Starting with random planes, the algorithm tries to improve
the current best solution ( )* *
,α β by considering two candidate planes ( )1 1,a b and
( )2 2,a b within a neighborhood of ( )* *
,α β . These candidate planes are given by,
( )
( )
( )
( )
* **
1 11
1 1* * *
1 1 1
* **
1 21
2 2* * *
1 1 2
T
T
T
T
a acv
a b
cv a a
a acv
a b
cv a a
β βα
α β β
β βα
α β β
⎫−+
⎪= =
+ ⎪− ⎪
⎬
− ⎪−
= = ⎪
− − ⎪⎭
(3.2)
Where c is a scalar that determines the size of the neighborhood visited, and v is a
unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to

31

start a global search and then to concentrate on the region of the global maximum by
decreasing the value of c. After a specified number of steps, called half, without an
increase of the projection index, the value of c is halved. When this value is small
enough, the optimization is stopped. Part of the search still remains global to avoid
being kept in dummy local optimum. The complete search of the best plane contains
m such random searches with different random starting planes. The goal of PP
algorithm is to find best projection plane.
The steps for PPEDA are given below:
1. Sphere the OD set, let’s say,Z is the matrix of sphered data set.
2. Generate a random starting plane ( )0 0
,α β , where 0 0
andα β are orthonormal.
Consider this as the current best plane ( )* *
,α β .
3. Evaluate the projection index ( )2
* *
,PIχ
α β for the starting plane.
4. Generate two candidate plane ( )1 1,a b and ( )2 2,a b according to the Eq. (3.2)
5. Now calculate the projection index for these candidate planes.
6. Choose the candidate plane with a higher value of the projection pursuit index
as the current best plane ( )* *
,α β .
7. Repeat steps 4 through 6 while there are improvements in the projection
pursuit index.
8. If the index does not improve for certain time, then decrease the value of c by
half
9. Repeat step 4 to step 8 until c becomes some small number (say .01).
3.2.2.3 Structure removal
There may be more than one interesting projection, and there may be other
views that reveal insights about the hyperspectral data. To locate other views,
Friedman (1987) proposed a method called structure removal. In this approach, first
we perform the PP algorithm on the data set to obtain the structure which means the
optimal projection plane. The approach then removes the structure found at that
projection, and repeats the projection pursuit process to find a projection that yields
another maximum value of the projection pursuit index. By proceeding in this

32

manner, it will give a sequence of projections providing informative views of the data.
The procedure repeatedly transforms the projected data to standard normal until
they stop becoming more normal as measured by the projection pursuit index. One
starts with a p p× matrix, where the first two rows of the matrix are the vectors of
the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal
and ‘0’ elsewhere. For example, if p = 4, then
* * * *
1 2 3 4
* * * *
* 1 2 3 4
0 0 1 0
0 0 0 1
U
α α α α
β β β β
⎡ ⎤
⎢ ⎥
⎢ ⎥=
⎢ ⎥
⎢ ⎥
⎢ ⎥⎣ ⎦
(3.3)
Gram-Schmidt orthonormalization process (Strang, 1988) makes the rows of *
U
orthonormal. Let U is the orthonormal matrix of *
U . The next step in the structure
removal process is to transform the Z matrix using the following equation,
T
T UZ= (3.4)
Where T is a p n× matrix. With this transformation, the first two rows of T of every
transformed observations are the projection onto the plane given by ( )* *
,α β . Now
applying a transformation (Θ ), which transforms the first two rows of T to a
standard normal and the rest remain unchanged, structure removal is performed
(Martinez, 2004). This is where the structure is removed, making the data normal in
that projection (the first two rows). The transformation is defined as follows,
( ) ( )
( ) ( )
( )
1
1 1
1
2 2
3,4,.........,i i
T F T
T F T
T T i p
φ
φ
−
−
⎫⎡ ⎤Θ = ⎣ ⎦ ⎪⎪
⎡ ⎤Θ = ⎬⎣ ⎦
⎪
Θ = = ⎪⎭
(3.5)
Where 1
φ−
the inverse of the standard normal cumulative distribution function, 1T
and 2T are the first two rows of the matrix T and F is a function defined in Eq. (3.7).
From Eq. (3.3), it is seen that only the first two row of T are changing. 1T and 2T can
be written as,
( )
( )
* * * *
* * * *
1 1 2
2 1 2
, ......., ,.......,
, ......., ,.......,
j n
j n
T z z z z
T z z z z
α α α α
β β β β
=
=
(3.6)

33

Where
*
jzα
and
*
jzβ
are coordinates of the jth observation projected onto the plane
spanned by( )* *
,α β . Next, a rotation is defined about the origin through the angle as
follows
( ) ( ) ( )
( ) ( ) ( )
1 1 2
2 2 1
cos sin
cos sin
t t t
j j j
t t t
j j j
z z z
z z z
γ γ
γ γ
= +
= −
(3.7)
Where 0, / 4, / 8,3 / 8γ π π π= and ( )1 t
jz represents the jth element of 1T at the tth
iteration of the process. Now, applying the following transformation on Eq. (3.7) to the
rotated points it replaces each rotated observation by its normal score in the
projection.
( )
( )
( )
( )
( )
( )
1
1 1 1
2
2 1 1
0.5
.5
t
jt
j
t
jt
j
r z
z
n
r z
z
n
φ
φ
+ −
+ −
⎧ ⎫−⎪ ⎪
= ⎨ ⎬
⎪ ⎪⎩ ⎭
⎧ ⎫−⎪ ⎪
= ⎨ ⎬
⎪ ⎪⎩ ⎭
(3.8)
Where ( )
( )1 t
jr z represents the rank of ( )1 t
jz
With this procedure, the projection index is reduced by making the data more
normal. During the first few iteration, the projection index should decrease rapidly
(Friedman, 1987). After approximate normality is obtained, the index might oscillate
with small changes. Usually, the process takes between 5 to 15 complete iterations to
remove the structure. Once the structure is removed using this process, data is
transformed back using the following equation,
( )T T
Z U UZ′ = Θ (3.9)
From Matrix Theory (Strang, 1988), it is known that all directions that are
orthogonal to the structure (i.e., all rows of T other than the first two) have not been
changed, whereas the structure has been Gaussianized and then transformed back.
Next section will describe the summary of the steps of PP,

34

3.2.2.4 Steps of PP
1. Load the data and set the value of the parameters like number of best
projection plane (N), number of neighborhood for random starts (m), value of c
and half
2. Sphere the data and obtain the Z matrix.
3. Find each of the desired number of projection plane (structures) (3.3.4.2) using
Posse chi-squareindex.
4. Remove the structure (to reduce the effect of local optimum) and find another
structure (3.3.4.3) until the projection pursuit index stop changing.
5. Continue the process until the best projection plane (orthogonal to each other)
is obtained.
3.2.3 Kernel principal component analysis (KPCA)
Kernel principal component analysis (KPCA) means conducting PCT in feature
space (kernel space). KPCA is applied on the variables which are nonlinearly related
to the input variables. In this section KPCA algorithm has been described through
PCA algorithm.
First m number of TP ( , 1,........,n
ix R i m∈ = ) are chosen. PCA finds the principal
axes by diagonalizing the following covariance matrix,
1
1 m
T
j j
j
C x x
m =
= ∑ (3.10)
The covariance matrix C is positive definite; hence, non-negative eigen values
can be obtained.
v Cvλ = (3.11)
For PCA, first sort the eigen values in decreasing order and find the corresponding
eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this
manner. Now next step is rewriting of PCA in terms of dot product. Now substituting
Eq. (3.10) in Eq. (3.11)
1
1 m
T
j j
j
Cv x x v v
m
λ
=
= =∑
Thus

35

( )
1
1
1
1
.
m
T
j j
j
m
j j
j
v x x v
m
x v x
m
λ
λ
=
=
=
=
∑
∑
(3.12)
since ( ) ( ). .T
x x v x v x=
In Eq. (3.12), the term ( ).jx v is a scalar. This means that all the solutions v with λ ≠
0 lie in the span of 1,......, mx x , i.e.
1
m
i i
i
v xα
=
= ∑ (3.13)
Steps for KPCA
1. For KPCA, first transform the TPs using a kernel function (Φ ) to feature space
( H ). Data set ( ( ), 1,.....,ix i mΦ = ) in feature space are assumed as centered to
reduce the complexity of calculation. The covariance matrix in H of the data
set takes the form as following
1
1
( ) ( )
m
T
j j
j
C x x
m =
= Φ Φ∑ (3.14)
2. Find the eigen values 0λ ≥ and corresponding non zero eigen vectors
{0}v H∈ of the covariance matrix C from the equation,
v Cvλ = (3.15)
3. As shown in previously (for PCA), all solution of v ( 0λ ≠ ) lie in the span of
1( ),........, ( )mx xΦ Φ , i.e.,
1
( )
m
i i
i
v xα
=
= Φ∑ (3.16)
Therefore,
1
( )
m
i i
i
Cv v xλ λ α
=
= = Φ∑ (3.17)
Substituting Eq. (3.14) and eq. 3.16 in Eq. (3.17)
1 1 1
( ) ( ) ( ) ( )
m m m
T
j j j i i j
j i j
m x x x xλ α α
= = =
Φ = Φ Φ Φ∑ ∑∑ (3.18)
4. Define kernel inner product by ( , ) ( ) ( )T
i j i jK x x x x= Φ Φ . Substituting this in Eq.
(3.18) following equation is obtained.

36

1 1 1
( ) ( ) ( , )
m m m
j j j i i j
j i j
m x x K x xλ α α
= = =
Φ = Φ∑ ∑∑ (3.19)
5. To express the relationship in Eq. (3.19) entirely in terms of the inner-product
kernel, premultiply both sides by ( )T
kxΦ for all k = 1,……,m. Define the m ×m
matrix K, called the kernel matrix, whose ijth element is the inner-product
kernel , ( , )i jK x x . The vector α of length m, whose jth element is the coefficient
jα .
6. Finally, Eq. (3.19) can be written as,
1 1 1
1
( ) ( ) ( ) ( ) ( ) ( )
1,2,....,
m m m
T T T
j k j j k i i j
i i j
x x x x x x
m
k m
λ α α
= = =
Φ Φ = Φ Φ Φ Φ
∀ =
∑ ∑∑ (3.20)
Now Eq. (3.20) can be transformed as (using ( , ) ( ) ( )T
i j i jK x x x x= Φ Φ ),
2
m K Kλ α α= (3.21)
To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be
solved,
m Kλα α=
(3.22)
7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel
matrix K. Let 1 2 ........ mλ λ λ≥ ≥ ≥ be the eigen values of K and 1 2, ,......., mβ β β be
the corresponding set of eigen vectors with pλ being the last non zero eigen
value.

Figu
8. To
eige
H. T
9. In t
it is
H (
feat
equ
Figure-3.5

(
ure 3.4: (a
Th
on
extract pr
en vectors β
Then
the above a
s certainly
Schölkopf,
ture space
ation for k
,i jK
5 provides t
(a)
a) Input po
he three g
nly (Wikipe
incipal com
nβ in H (n
β
algorithm,
difficult to
2004). Th
. However
kernel PCA
( 1mK K= −
the outline

oints befor
groups are
edia, 2010)
mponent, i
1,...., p= ).
, ( )n xβ Φ = ∑
it has bee
o obtain th
herefore, it
r, there is
A. It is need
1 1m mK K− +
e of KPCA a
37
e kernel P
e distingui
).
it is neede
Let x be a
1
( )
m
n i
i
xβ
=
Φ∑
n assumed
he mean of
is problem
a way to
ded to diago
) ,
1 Wm m i j
K
algorithm.
PCA (b) Ou
shable usi
ed to comp
a test point
), ( )xΦ
d that the d
f the mappe
matic to cen
o do it by
onalize the
Where (1 )m ij

(b)
utput after
ing the fir
pute projec
t, with an i
data set is
ed data in
nter the m
slightly m
e kernel ma
1
: ,i j
m
= ∀

r kernel P
rst compon
ction onto
image (xΦ
(3.2
centered,
feature sp
mapped data
modifying
atrix K,
(3.2
CA.
nent
the
) in
23)
but
pace
a in
the
24)

38

Figure 3.5: Outline of KPCA algorithm
3.2.4 Orthogonal subspace projection (OSP)
subspace projection is to eliminate all unwanted or undesired spectral
signatures (background) within a pixel, then use a matched filter to extract the
desired spectral signature (endmember) present in that pixel.

39

3.2.4.1 Automated target generation process algorithm (ATGP)
In hyperspectral image analysis a pixel may encompass many different
materials; such pixels are called mixed pixels. It contains multiple spectral
signatures. Let a column vector ir represent the mixed pixel by linear model,
i i ir M nα= + (3.25)
where the vector ir is a 1l × column vector, represents the ith mixed pixel. l is the
number of spectral bands. Each distinct material in the mixed pixel is called an
endmember (p). Assume that there are p spectrally distinct endmembers in the ith
mixed pixel. M is a matrix of dimension l p× , is made up of linearly independent
columns. These columns are denoted by ( )1 2, ,......, ,.......,j pm m m m . Here this system is
considered as over determined (l p> ) system and jm denotes the spectral signature of
the jth distinct material or endmember. Let α be a p column vector given by
( )1 2, ,......, ,......,
T
j pα α α α where the jth element represents the fraction of the jth
signature as present in the ith mixed pixel. ni is a 1l × column vector presenting the
white Gaussian noise with zero mean and covariance matrix 2
Iσ where I is an l l×
identity matrix.
In the Eq. (3.25), assume ir ’s are a linear combination of p endmembers with
the weight coefficients designated by the fraction vector iα . The term iMα has been
rewritten to separate the desired spectral signatures from the undesired signatures.
In other way, targets are being separated from background. In searching for a single
spectral signature this can be written as:
pM d Uα α γ= + (3.26)
Where d is l l× matrix, the desired signature of interest containing column vector mp
while pα is 1 1× , the fraction of the desired signature. The matrix U is composed of
the remaining column vectors from M. These are the undesired spectral signatures or
background information. This is given by ( )1 2 , 1, ,....., ........,j pU m m m m −= with
dimension ( 1)l p× − where γ is a column vector containing rest of ( )1p − components
(fractions) of α

40

Suppose P is an operator, which eliminates the effects of U, the undesired
signatures. To do this, an operator (orthogonal subspace operator) has been developed
that projects r onto a subspace that is orthogonal to the columns of U. This results in
a vector that only contains energy associated with the target d and noise n. The
operator used is the l l× matrix
( )1
1 ( )T T
P U U U U−
= − (3.27)
The operator P maps d into a space orthogonal to the space spanned by the
uninteresting signatures in U. Now apply the operator P on the mixed pixel r from
Eq. (3.25)
Pr pPd PU Pnα γ= + + (3.28)
It should be noticed that P operating on Uγ reduces the contribution of U to zero
(close to zero in real data applications). Therefore, from above rearrangement we
have
Pr pPd Pnα= + (3.29)
3.2.4.1 Signal-to-Noise Ratio (SNR) Maximization
The second step in deriving the pixel classification operator is to find the 1 l×
operator T
X that maximizes the SNR. Operating on Eq. (3.28) get
PrT T T T
pX X Pd X PU X Pnα γ= + + (3.30)
The operator T
X acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is
given by,
2T T T
p
T T T
X Pd d P X
X PE nn P X
α
λ =
⎡ ⎤⎣ ⎦
(3.31)
2
2
T T T
p
T T
X Pdd P X
X PP X
α
λ
σ
⎛ ⎞
= ⎜ ⎟⎜ ⎟
⎝ ⎠
(3.32)
where [ ]E denotes the expected value. Maximization of this quotient is the
generalized eigenvector problem
T T T
Pdd P X PP Xλ= (3.33)

41

where
2
2
p
σ
λ λ
α
⎛ ⎞
= ⎜ ⎟⎜ ⎟
⎝ ⎠
, The value of T
X which maximizes λ can be determined in general
using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and
symmetric properties of the interference rejection operator. As it turns out the value
of T
X which maximizes the SNR is
T T
X kd= (3.34)
where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is
seen that the overall classification operator for a desired hyperspectral signature in
the presence of multiple undesired signatures and white noise is given by the 1 l×
vector as
T T
q d p= (3.35)
This result first nulls the interfering signatures, and then uses a matched filter for
the desired signature to maximize the SNR. When the operator is applied to all of the
pixels in a hyperspectral scene, each 1l × pixel is reduced to a scalar which is a
measure of the presence of the signature of interest. The ultimate aim is to reduce the
l images that make-up the hyperspectral image cube into a single image where pixels
with high intensity indicate the presence of the desired signature.
This operator can be easily extended to seek out k signatures of interest. The
vector operator simply becomes a k l× matrix operator which is given by,
( )1 2, ,...., ,....,j kQ q q q q= (3.36)
When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral
scene, each 1l × pixel is reduced to 1 1× vector. Ultimately, l dimensional
hyperspectral image reduces to single dimensional feature extracted image where
pixels with high intensity indicate the presence of the desired signature. Thus for k
desired signature hyperspectral image can be reduce to k dimensional feature
extracted image. Here each band corresponds to the each desired signature.
The above algorithm is discussed with the following example:
Let us start with three vectors or classes, each six elements or bands long. The
vectors are in reflectance units and can be seen below.

42

0.26 0.07 0.07
0.30 0.07 0.13
0.31 0.11 0.19
0.31 0.54 0.25
0.31 0.55 0.30
0.31 0.54 0.34
Concrete Tree Water
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
= = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels
looks like,
( ) ( ) ( )40 .08 .75 .07pixel concrete tree dirt noise= + + + (3.37).
Let us assume that the noise is zero. If all the pixel mixture fractions have been
defined, particular class spectrum can be chosen to extract from the image. Suppose
the concrete material has to be extracted throughout the image. Same procedure can
be followed to extract grass and tree material.
Assume that 40pixel is made up some weighted linear combination of
endmembers.
40pixel M noiseα= + (3.38)
Now Mα can be break up into desired, dα and undesired, Uγ signatures. Now
assign the desired as d and undesired as U signatures to spectrum. Let concrete be
the vector d and tree and water be the column vectors of the matrix U. However, the
fractions of mixing are unknown to us. But it is known that 40pixel is made up of
some combination of d and U.
,d concrete and U tree water= =⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦
Now it is required to reduce the effect of U. To do this it is needed to find a
projection operator P, that when operated on U, will reduce its contribution to zero.
To find concrete, d, 40pixel is projected onto a subspace that is orthogonal to the
columns of U using the operator P. In other words, P maps d into a space orthogonal
to the space spanned by the undesired signatures while simultaneously minimizing
the effects of U. If P is operated on U, which contains tree and water, then it is seen
that the effect of U is minimized.

43

00
0 0
0 0
0 0
0 0
PU
⎡ ⎤
⎢ ⎥
⎢ ⎥
⎢ ⎥=
⎢ ⎥
⎢ ⎥
⎢ ⎥⎣ ⎦
(3.39)
Now let 1r = 40pixel and n = noise, then from eq. (3.29),
1Pr pPd Pnα= + (3.40)
Now operator T
x needs to find out which will maximizes the signal-to noise
ratio (SNR). The operator T
x acting on 1Pr will produce a scalar. As stated before, the
value of T
x which maximizes the SNR is T T
X kd= . This leads to an overall OSP
operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the
entire data vector can be project along the columns of Q and OSP feature extracted
image is formed.
3.3 Supervised classifier
This section describes the mathematical background of supervised classifiers.
First, it will describe the Bayesian decision rule followed by the decision rule for
Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k-
nearest neighbor (KNN) and Support vector machine (SVM) classification rules.
3.3.1 Bayesian decision rule
In pattern recognition, patterns need to be classified. There are plenty of
decision rules available in literatures but only Bayes Decision Theory is optimal
(Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose
there are K classes and let ( )f x
k
be the distribution function of the kth class, where
0 k K< < , and ( )kP c is the prior probability of the kth classes such that
1
( ) 1
K
k
k
P c
=
=∑ .
For any class k , the posteriori probability for a pixel vector x is denoted by ( )|k kp c x
and defined by (assuming all classes are mutually exclusive):
1
( | ) ( )
( | )
( ) ( )
k k
k k K
k k
k
k
P x c P c
p c
f P c
=
=
=
∑
x
x
(3.41)

44

Therefore, the Bayes decision rule is:
( | ) max ( | )i i i k k
k
c if p c p c∈ =x x x (3.41a)

3.3.2 Gaussian maximum likelihood classification (GML):
Gaussian maximum likelihood classifier assumes that the distribution of the data points is
Gaussian (normally distributed) and classifies an unknown pixel based on the variance and
covariance of the spectral response patterns. This classification is based on probability density
function associated with training data. Pixels are assigned to the most likely class based on a
comparison of the posterior probability that it belongs to each of the signatures being considered.
Under this assumption, the distribution of a category response pattern can be completely described
by the mean vector and the covariance matrix. With these parameters, the statistical probability of
a given pixel value being a member of a particular land cover class can be computed (Lillesand et
al., 2002). GML classification can obtain minimum classification error under the assumption that
the spectral data of each class is normally distributed. It considers not only the cluster centre but
also its shape, size and orientation by calculating a statistical distance based on the mean values
and covariance matrix of the clusters. The decision boundary for the GML classification is:
( ) 1ˆ ˆˆ ˆ(1 2) ln ( ) ( )T
k k k k
−⎡ ⎤− + − −⎢ ⎥⎣ ⎦
x xΣ Σμ μ
(3.42)
And the final bayesian decision rule is:
( ) max ( )j j k
k
c if g g∈ =x x x
where ( )kg x is the decision boundary function for kth class.
3.3.3 k – nearest neighbor classification
KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification
technique which has been proven to be effective in pattern recognition. However, its
inherent limitations and disadvantages restrict its practical applications. One of the
shortages is lazy learning which makes the traditional KNN time-consuming. In this
thesis work traditional KNN process has been applied (Fix and Hodges, 1951).
The k-nearest neighbor classifier is commonly based on the Euclidean distance
between a test pixel and the specified TP. The TP are vectors in a multidimensional
feature space, each with a class label. In the classification phase, k is a user-defined

45

constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which
is most frequent among the k training samples nearest to that test pixel.

Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified
either to the first class of squares or to the second class of triangles. If k
= 3, it is classified to the second class because there are 2 triangles and
only 1 square inside the inner circle. If k = 5, it is classified to first class
(3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified
to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).
Let x be a n -dimensional test pixel and iy ( (1,2.... ))i p= is n -dimensional TP,
Euclidian distance between them is defined by:
2 2 2
11 1 12 2 1( , ) ( ) ( ) ...... ( )i i i i n ind x y x y x y x y= − + − + + − (3.43)
Where 11 12 1( , ...... ),nx x x x= 1 2( , ...... )i i i iny y y y= and 1 2{ , ...... }pD d d d= , p is number of TP
The final KNN decision rule is:

46

j
1 , even
2
if minimum element of D corresponding to c is
, odd
2
j
k
k
x c
k
k
⎧ ⎫⎛ ⎞⎡ ⎤
+⎪ ⎪⎜ ⎟⎢ ⎥
⎪ ⎪⎣ ⎦⎝ ⎠∈ ⎨ ⎬
⎡ ⎤⎪ ⎪
⎢ ⎥⎪ ⎪⎢ ⎥⎩ ⎭
(3.44)
In case of tie, the test pixel is assigned to the class jc if its distance from the mean
vector of the class jc is minimum.
Where ,( 1,2,....., )ik i p= is a user defined parameter which implies the number of
nearest neighbor is chosen for classification. The outline of algorithm of KNN
classification is given in Figure: 3.7
Figure 3.7: Outline of KNN algorithm
3.3.4 Support vector machine (SVM):
The foundations of Support Vector Machines (SVM) have been developed by
Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)

47

principle, which has been shown to be superior, (Gunnet al., 1997), to traditional
Empirical Risk Minimization (ERM) principle, employed by conventional neural
networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM
that minimizes the error on the training data. SVMs were developed to solve the
classification problem, but recently they have been extended to the domain of
regression problems (Vapnik et al., 1997).
SVM is basically a linear learning machine based on the principle of optimal
separation of classes. The aim is to find a hyperplane which linearly separates the
class of interest. The linear separating hyperplane is placed between the classes in
such a way that it satisfies two conditions.
(i) All the data vector that belongs to the same class are placed to the same side of
separating hyperplane.
(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).
The main aim of SVM is to define an optimum hyperplane between two classes
which will maximize the boundary of two classes. For each class, the data vectors
forming the boundary of classes are called the support vectors (SV) and the
hyperplane is called decision surface (Pal, 2002).
3.3.4.2 Statistical learning theory
The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical
framework for learning from input training with known class and predict the outcome of data point
with unknown identity. The first is called ERM whose aim is to reduce the training error and the
second is called SRM, whose goal is to minimize the upper bound on the expected error on the
whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999).
First, it does not depend on the unknown cumulative distribution function. Secondly, it can be
minimized with respect to the parameter, which is used in decision rule.
3.3.4.2 Vapnik and Charvonenkis dimension (VC-dimension):
VC dimension is a measure of the capacity of a set of classification functions. The
VC-dimension, generally denoted by h, is an integer that represents the largest number of
data points that can be separated by a set of functions fα in all possible ways. For
example, for a arbitrary classification problem, VC-dimension is the maximum

48

number of points, which can be separated into two classes without error in all
possible 2k ways (Varshney and Arora, 2004).
3.3.4.3 Support vector machine algorithm with quadratic optimization
method (SVM_QP):
The procedure of obtaining a separating hyperplane by SVM is explained for a
simple linearly separable case for two classes which can be separated by a hyperplane
and it can be extended for the multiclass classification problem. This procedure then
can be extended to the case where a hyperplane cannot separate the two classes that
is kernel method for SVM.
Let there are n number of training samples obtained from two classes,
represented as 1 1 1 1( , ),( , ),..........,( , )n nx y x y x y where m
ix R∈ , m is the dimension of the
data vector with each sample belonging to either of the two classes labeled by
{ 1, 1}y∈ − + . These samples are said to be linearly separable if there exists a
hyperplane in m-dimensional space whose orientation is given by a vector w and
whose location is determined by a scalar b as offset of this hyperplane from the origin
(Figure 3.8). In case such a hyperplane exists then the given set of training data
points must satisfy the following inequalities:
1, : 1i iw x b i y⋅ + ≥ + ∀ = + (3.45)
1, : 1i iw x b i y⋅ + ≤ − ∀ = −
(3.46)
Thus, the equation of hyperplane is given by 0iw x b⋅ + = .

49

Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after
Gunn, 1998).
The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality
as:
( . ) 1i iy w x b+ ≥ (3.47)
Thus, the decision rule for the linearly separable case can be defined in the following
form:
( . )i ix sign w x b∈ + (3.48)
Where, (.)sign is the signum function whose value is +1 for any element greater than
or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily
represent the two classes given by labels +1 and –1.
The separating hyperplane (Figure 3.8) will be able to separate the two classes
optimally when its margin from both the classes is equal and maximum (Varshney,
2004) i.e. the hyperplane should be located exactly in the middle of the two classes.

50

The distance ( ; , )D x w b is used to express the margin of separation or margin for a
point x from the hyperplane defined by w and b. It is given by
2
.
( ; , )
w x b
D x w b
w
+
= (3.49)
Where, 2
denotes the second norm which is equivalent to the Euclidean length of
the element vector for which it is being computed and is the absolute function. Let
d be the value of the margin between two separating planes. To maximize the
margin, express the value of d as
2 2
. 1 . 1w x b w x b
d
w w
+ + + −
= −
2
2
w
=
2
T
w w
= (3.49a)
To obtain an optimal hyperplane the margin value (d ) should be maximized i.e.
2
2
w
should be maximized, it is equivalent to minimization of the 2-norm of the vector w.
Thus, the objective function Φ(w) of finding the best separating hyperplane
reduces to
1
( )
2
T
w w wΦ = (3.50)
A constrained optimization problem can be constructed for minimizing the objective
function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of
constrained optimization problem with a convex objective function of w and linear
constraints is called a primal problem and can be solved using standard Quadratic
Programming (QP) optimization techniques. The QP optimization technique can be
implemented by replacing the inequalities in a simpler form by transforming the
problem into a dual space representation using Lagrange multipliers ( iλ )
(Leunberger, 1984). The vector w can be defined in terms of Lagrange multipliers ( iλ )
as shown:

51

1
1
,
0
t
n
i i i
i
n
i i
i
w y x
y
λ
λ
=
=
=
=
∑
∑
(3.51)
The dual optimization problem reduced by Lagrange’s multipliers ( λi ) thus
becomes
1 1 1
1
max ( , , ) ( )
2
n n n
i i j j i i j
i i j
L w b y y x x
λ
λ λ λ λ
= = =
= − ⋅∑ ∑∑ (3.52)
Subjected to the constraints:
1
0
n
i i
i
yλ
=
=∑ (3.53)
0, 1,2,...,i i nλ ≥ = (3.54)
Solution of the optimization problem is obtained in terms of Lagrange’s
multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor,
2000) some of the Lagrange’s multiplier will be zero. The multipliers which have
nonzero values are called SVs. The result from an optimizer, also called as an optimal
solution, will be a set of unique and independent multipliers: 1 2( , ,..., )s
o o o o
nλ λ λ λ=
where, sn is the number of support vectors found. Substituted these in Eq. (3.51) to
obtain the orientation of optimal separating hyperplane ( o
w ) as
0 0
1
n
i i i
i
w y xλ
=
= ∑ (3.55)
The offset from origin ( 0
b ) is determined from the equation given below,
0 0 0 0 0
1 1
1
2
b w x w x+ −
⎡ ⎤= +⎣ ⎦ (3.56)
Where 0
1x+ and 0
1x− are support vector of class labels +1 and -1 respectively. The
following decision rule (obtained from Eq. (3.48)) is then applied to classify the data
vectors into two classes +1 and -1:
0 0
support vectors
( ) ( ( . ) )i i if x sign y x x bλ= +∑ (3.57)
Eq. (3.57) implies that
0 0
support vectors
( ( . ) )i i ix sign y x x bλ∈ +∑ (3.58)

Thesis

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Thesis

Similar to Thesis (20)

Thesis