SlideShare a Scribd company logo
 
 
 
 
 
SU
DEPA
INDIAN
UPERV
HYPE
SOUM
ARTME
N INSTITU
VISED L
ERSPE
B
MYADI
NT OF C
UTE OF
July
i
LEARN
CTRAL
By
IP CHA
CIVIL E
TECHN
y 2010
ING W
L DATA
ANDRA
ENGINEE
NOLOGY
WITH
A
ERING
KANPUUR
A D
 
SU
Dissertat
DEPA
INDIAN
UPERV
HYPE
tion Sub
Require
Ma
SOUM
ARTME
N INSTITU
VISED L
ERSPE
bmitted
ements
aster of
B
MYADI
(Y81
NT OF C
UTE OF
July
i
LEARN
CTRAL
In Parti
for the D
Techno
By
IP CHA
103044)
CIVIL E
TECHN
y 2010
ING W
L DATA
ial Fulfil
Degree o
ology
ANDRA
ENGINEE
NOLOGY
WITH
A
llment o
of
ERING
KANPU
of the
UR
ii 
 
iii 
 
ABSTRACT
Hyperspectral data (HD) has ability to provide large amount of spectral
information than multispectral data. However, it suffers from problems like curse
of dimensionality and data redundancy. The size of data set is also very large.
Consequently, it is difficult to process these datasets and obtain satisfactory
classification results.
The objectives of this thesis are to find the best feature extraction (FE)
techniques and improvement in accuracy and time for classification of HD by
using parametric (Gaussian maximum likely hood (GML)), non-parametric (k-
nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In
order to achieve these objectives, experiments were performed with different FE
techniques like segmented principal component analysis (SPCA), kernel principal
component analysis (KPCA), orthogonal subspace projection (OSP) and projection
pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations
in this thesis work.
From the experiments performed with the parametric and non-parametric
classifier, the GML classifier was found gave the best results with an overall
kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP)
per class and 45 bands on SPCA feature extracted data set.
SVM algorithm with quadratic programming (QP) optimizer gave the best results
amongst all optimizers and approaches. The overall k-value of 96.91% was
achieved by using 300 TP per class and 20 bands of SPCA feature extracted data
set. However, the supervised FE techniques like KPCA and OSP failed to improve
results obtained by SVM significantly.
The best results obtained for GML, KNN and SVM were compared by the
one-tailed hypothesis testing. It was found that SVM classifier performed
significantly better than the GML classifiers for statistically large set of TP (300).
For statistically exact (100) and sufficient (200) set of TP, the performance of SVM
on SPCA extracted data set is statistically not better than the performance of
GML classifier.
iv 
 
ACKNOWLEDGEMENTS
I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for
his involvement, motivation and encouragement throughout and beyond the thesis
work. His expert directions have inculcated in my qualities which I will treasure
throughout my life. His patient hearing, critical comments approach to the research
problem made me do better every time. His valuable suggestions to all stages of the
thesis work helped me to improvise various sorts of my shortcomings of my thesis
work. I also express my sincere thanks for his effort in going through the
manuscript carefully and making it more readable. It has been a great learning
and life changing experience working with him.
I would like to express my sincere tribute to Dr. Bharat Lohani for his
friendly nature, excellent guidance and teaching during my stay at IITK.
I would like to thank specially to Sumanta Pasari for his valuable
comments and corrections of the manuscript of my thesis.
I would like to thank all of my friends, especially Shalabh, Pankaj, Amar,
Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI
peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous,
pleasant and memorable one.
In closure, I express my cordial homage to my parents and my best friend
for their unwavering support and encouragement to complete my study at IITK
SOUMYADIP CHANDRA
July 2010
 
 
 
 
 
v 
 
 
 
CONTENTS
CERTIFICATE………………………………………………………………………….. ii 
ABSTRACTS........................................................................................................... iii 
ACKNOWLEDGEMENTS……………………………………………………………. iv 
CONTENTS………………………………………………………………………………...v
LIST OF TABLES………………………………………………………………………..ix
LIST OF FIGURES..................................................................................................x
LIST OF ABBREVIATIONS…………………………………………………………xiii
CHAPTER 1 - Introduction.........................................................................1
1.1 High dimensional space.......................................................................................2
1.1.1 What is hyperspectral data?.........................................................................2
1.1.2 Characteristics of high dimensional space..................................................3
1.1.3 Hyperspectral imaging .................................................................................4
1.2 What is classification? .........................................................................................5
1.2.1 Difficulties in hyperspectral data classification..........................................5
1.3 Background of work.............................................................................................6
1.4 Objectives .............................................................................................................7
1.5 Study area and data set used..............................................................................7
1.6 Software details ...................................................................................................9
vi 
 
1.7 Structure of thesis ...............................................................................................9
CHAPTER 2 – Literature Review........................................................10
2.1 Dimensionality reduction by feature extraction..................................................10
2.1.1 Segmented principal component analysis (SPCA)........................................11
2.1.2 Projection pursuit (PP) ...............................................................................11
2.1.3 Orthogonal subspace projection (OSP) .....................................................12
2.1.4 Kernel principal component analysis (KPCA) .........................................12
2.2 Parametric classifiers........................................................................................13
2.2.1 Gaussian maximum likelihood (GML).......................................................13
2.3 Non–parametric classifiers ..............................................................................14
2.3.1 KNN .............................................................................................................14
2.3.2 SVM..............................................................................................................15
2.4 Conclusions from literature review ..................................................................19
CHAPTER 3 – Mathematical Background...................................21
3.1 What is kernel? ..................................................................................................21
3.2 Feature extraction techniques ..........................................................................24
3.2.1 Segmented principal component analysis (SPCA)....................................25
3.2.2 Projection pursuit (PP) ...............................................................................27
3.2.3 Kernel principal component analysis (KPCA) ..........................................34
3.2.4 Orthogonal subspace projection (OSP) ......................................................38
vii 
 
3.3 Supervised classifier..........................................................................................43
3.3.1 Bayesian decision rule ................................................................................43
3.3.2 Gaussian maximum likelihood classification (GML):...............................44
3.3.3 k – nearest neighbor classification.............................................................44
3.3.4 Support vector machine (SVM): .................................................................46
3.4 Analysis of classification results.......................................................................58
3.4.1 One tailed hypothesis testing.....................................................................59
CHAPTER 4 - Experimental Design..................................................61
4.1 Feature extraction technique............................................................................62
4.1.1 SPCA ............................................................................................................62
4.1.2 PP .................................................................................................................62
4.1.3 KPCA............................................................................................................63
4.1.4 OSP...............................................................................................................64
4.2 Experimental design..........................................................................................64
4.3 First set of experiment (SET-I) using parametric and non-parametric
classifier........................................................................................................................66
4.4 Second set of experiment (SET-II) using advance classifier...............................67
4.5 Parameters......................................................................................................68
CHAPTER 5 - Results ....................................................................................69
5.1 Visual inspection of feature extraction techniques .........................................69
viii 
 
5.2 Results for parametric and non-parametric classifiers...................................75
5.2.1 Results of classification using GML classifier (GMLC) ...........................75
5.2.2 Class-wise comparison of result for GMLC...............................................81
5.2.3 Classification results using KNN classifier (KNNC) ................................82
5.2.4 Class wise comparison of results for KNNC .............................................91
5.3 Experiment results for SVM based classifiers.................................................92
5.3.1 Experiment results for SVM_QP algorithm..............................................93
5.3.2 Experiment results for SVM_SMO algorithm...........................................97
5.3.3 Experiment results for KPCA_SVM algorithm.......................................100
5.3.4 Class wise comparison of the best result of SVM ...................................103
5.3.5 Comparison of results for different SVM algorithms .............................104
5.4 Comparison of best results of different classifiers.........................................105
5.5 Ramifications of results...................................................................................107
CHAPTER 6 - Summary of Results and Conclusions .......109
6.1 Summary of results..........................................................................................109
6.2 Conclusions.......................................................................................................112
6.3 Recommendations for future work .................................................................112
REFERENCES………………………………………………….……………….115
APPENDIX A……………………………………………………………………..120 
 
ix 
 
LIST OF TABLES
Table Title Page
2.1 Summary of literature review 18
3.1 Examples of common kernel functions 23
4.1 List of parameters 68
5.1 The time taken for each FE techniques 71
5.2 The best kappa values and z-statistic (at 5% significance values)
for GML
80
5.3 Ranking of FE techniques and time required to obtain the best k-
value
80
5.4 Classification with KNNC on OD and feature extracted data set 84
5.5 The best k-values and z-statistic for KNNC 89
5.6 Rank of FE techniques and time required to obtain best k-value 90
5.7 The best kappa accuracy and z-statistic for SVM_QP on different
feature modified data set
95
5.8 The best k-value and z-statistic for SVM_SMO on OD and different
feature modified data set
100
5.9 The best k-value and z-statistic for KPCA_SVM on original and
different feature modified data sets
104
5.10 Comparison of the best k-values with different FE techniques,
classification time, and z-statistic for different SVM algorithms
106
5.11 Statistical comparison of different classifier’s results obtained for
different data sets
107
5.12 Ranking of different classification algorithms depending on
classification accuracy and time. (Rank: 1 indicate the best)
109
x 
 
LIST OF FIGURES
Figure Title Page
1.1 Hyperspectral image cube 2
1.2 Fractional volume of a hypersphere inscribed in hypercube decrease
as dimension increases
4
1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8
1.4 FCC obtained by first 3 principal components and superimposed
reference image showing training data available for classes
identified for study area
8
1.5 Google earth image of study area 9
3.1 Overview of FE methods 24
3.2 Formation of blocks for SPCA 26
3.2a Chart of multilayered segmented PCA 27
3.3 Layout of the regions for the chi-square projection index 30
3.4 (a) Input points before kernel PCA (b) Output after kernel PCA.
The three groups are distinguishable using the first component
only
37
3.5 Outline of KPCA algorithm 38
3.6 KNN classification scheme 45
3.7 Outline of KNN algorithm 46
3.8 Linear separating hyperplane for linearly separable data 49
3.9 Non-linear mapping scheme 52
3.10 Brief description of SVM_QP algorithm 54
3.11 Overview of KPCA_SVM algorithm 58
3.12 Definitions and values used in applying one-tail hypothesis testing 60
4.1 SPCA feature extraction method 62
xi 
 
4.2 Projection pursuit feature extraction method 63
4.3 KPCA feature extraction method 63
4.4 OSP feature extraction method 64
4.5 Overview of classification procedure 66
4.6 Experimental scheme for Set-I experiments 67
4.7 The experimental scheme for advanced classifier (Set-II) 68
5.1 Correlation image of the original data set consisting of three
blocks having bands 32, 6 and 27 respectively
70
5.2 Projection of the data points. (a) Most interesting projection
direction (b) Second most interesting projection direction
71
5.3 First six Segmented Principal Components (SPCs) (b) shows water
body and salt lake
72
5.4 First six Kernel Principal Components (KPCs) obtained by using
400 TP
72
5.5 First six features obtained by using eight end-members 73
5.6 Two components of most interesting projections 73
5.7 Correlation images after applying various feature extraction
techniques
74
5.8 Overall kappa value observed for GML classification on different
feature extracted data sets using selected different bands
78
5.9 Comparison of kappa values and classification times for GML
classification method
81
5.10 Best producer accuracy of individual classes observed for GMLC
on different feature extracted data set with respect to different set
of TP
82
5.11 Overall accuracy observed for KNN classification of OD and
feature extracted data sets for 25 TP
85
5.12 Overall accuracy observed for KNN classification of OD and
feature extracted data sets for 100 TP
86
5.13 Overall accuracy observed for KNN classification of OD and
feature extracted data sets for 200 TP
87
5.14 Overall accuracy observed for KNN classification of OD and
feature extracted data sets for 300 TP
88
5.15 Time comparison for KNN classification. Time for different bands 91
xii 
 
at different neighbors for (a) 300 TP (b) 200 TP training data per
class
5.16 Comparison of best k-value and classification time for original and
feature extracted data set
91
5.17 Class wise accuracy comparison of OD and different feature
extracted data for KNNC
92
5.18 Overall kappa values observed for classification of FE modified
data sets using SVM and QP optimizer
94
5.19 Classification time comparison using 200 and 300 TP per class 97
5.20 Overall kappa values observed for classification of original and FE
modified data sets using SVM with SMO optimizer
100
5.21 Comparison of classification time different set of TPs with respect
to number of bands for SVM_SMO classification algorithm
101
5.22 Overall kappa values observed for classification original and feature
modified data sets using KPCA_SVM algorithm.
103
5.23 Comparison of classification accuracy of individual classes for
different SVM algorithms
105
xiii 
 
LIST OF ABBREVIATIONS
AC
DAFE
DAIS
DBFE
FE
GML
HD
ICA
KNN
k-value
KPCA
KPCA_SVM
MS
NWFE
Ncri
OD
OSP
PCA
PCT
PP
rbf
SPCA
SV
SVM
SVM_QP
Advance classifier
Discriminant analysis feature extraction
Digital airborne imaging spectrometer
Decision boundary feature extraction
Feature extraction
Gaussian maximum likelihood
Hyperspectral data
Independent component analysis
k-nearest neighbors
Kappa value
Kernel principal component analysis
Support vector machine with Kernel principal component
analysis
Multispectral data
Nonparametric weighted feature extraction
Critical value
Original data
Orthogonal subspace projection
Principal component analysis
Principal component transform
Projection pursuit
Radial basic function
Segmented principal component analysis
Support vectors
Support vector machine
Support vector machine with quadratic programming optimizer
xiv 
 
SVM_SMO
TP
Support vector machine with sequential minimal optimizer
Training pixels
Dedicated
to
my family & guide
ii 
 
CHAPTER 1
INTRODUCTION
Remote sensing technology has brought a new dimension in the field of earth
observation, mapping and in many other different fields. At the beginning of this
technology, multispectral sensors were used for capturing data. The multispectral
sensors capture data in a small number of bands with broad wavelength intervals.
Due to few spectral bands, their spectral resolution is insufficient to discriminate
amongst many earth objects. But if the spectral measurement is performed by using
hundreds of narrow wavelength bands, then several earth objects could be
characterized precisely. This is the key concept of hyperspectral imagery.
As compared to multispectral (MS) data set, hyperspectral data (HD) has large
information content, voluminous and also different in characteristics. So, the
extraction of that huge information from HD remains a challenge. Therefore, some
cost effective and computationally efficient procedures are required to classify the
HD. Data classification is the categorization of data for its most effective and efficient
use. As a result of classification, we need a high accuracy thematic map. HD has that
potentiality.
This chapter will provide the concept of high dimensional space, HD and
difficulties in classification of HD. Next part focuses on the objectives of the thesis
followed by an overview of data set used in this thesis. Details of the software used
are mentioned in the next part of this chapter followed by the structure of thesis.
1.1 High dimensional space
In Mathematics, an n-dimensional space is a topological space whose
dimension is n (where n is a fixed natural number). One of the typical example is n-
dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.
2 
 
n-dimensional spaces with large values of n are sometimes called high-dimensional
spaces (Werke, 1876). Many familiar geometric objects can be expressed by some
number of dimensions. For example, the two-dimensional triangle and the three-
dimensional tetrahedron can be seen as specific instances of the n-dimensional space.
In addition, the circle and the sphere are particular form of the n-dimensional
hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010).
1.1.1 What is hyperspectral data?
When spectral measurement is done by using hundreds of narrow contiguous
wavelength intervals then the captured image is called Hyperspectral image. Mostly,
the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In
this cube, x and y axes specify the size of image and λ axis specifies the dimension or
the number bands. Hyperspectral sensors corresponding to each band collect
information as a set of images. Each image represents a range of the electromagnetic
spectrum for each band.
Figure 1.1: Hyperspectral image cube (Richards and Jia, 2006)
These images are then combined and form a three dimensional hyperspectral
cube. As the dimension of the HD is very high, it is comparable with the high
dimensional space. HD follows same characteristics like high dimensional space
which are described in the following section.
3 
 
1.1.2 Characteristics of high dimensional space
High dimensional spaces, spaces with a dimensionality greater than three,
have properties that are substantially different from normal sense of distance,
volume, and shape. In particular, in a high-dimensional Euclidean space, volume
expands far more rapidly with increasing diameter in compared to lower-dimensional
spaces, so that, for example:
(i). Almost all of the volume within a high-dimensional hypersphere lies in a thin
shell near its outer "surface"
(ii). The volume within a high-dimensional hypersphere relative to a hypercube of
the same width tends to zero as dimensionality tends to infinity, and almost all
of the volume of the hypercube is concentrated in its "corners".
The above mentioned characteristics have two important consequences for high
dimensional data that appear immediately. The first one is, high dimensional space is
mostly empty. As a consequence, high dimensional data can be projected to a lower
dimensional subspace without losing significant information in terms of separability
among the different statistical classes (Jimenez and Landgrebe, 1995). The second
consequence of the foregoing is, normally distributed data will have a tendency to
concentrate in the tails; similarly, uniformly distributed data will be more likely to be
collected in the corners, making density estimation more difficult. Local
neighborhoods are almost empty, requiring the bandwidth of estimation to be large
and producing the effect of losing detailed density estimation (Abhinav, 2009).
4 
 
Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube
Figure 1.2: Fractional volume of a hypersphere inscribed in hypercube
decreases as dimension increases (Modified after Jimenez,
Landgrebe, 1995)
1.1.3 Hyperspectral imaging
  Hyperspectral imaging collects and processes information using the
electromagnetic spectrum. Hyperspectral imagery makes difference between many
types of earth’s objects, which may appear as the same color to the human eye.
Hyperspectral sensors look at objects using a vast portion of the electromagnetic
spectrum. The whole process of hyperspectral imaging can be divided into three steps:
preprocessing, radiance to reflectance transformation and data analysis (Varshney
and Arora, 2004).
In particular, preprocessing is required to convert the raw radiance to sensor
radiance. The processing steps contain the operations like spectral calibration,
geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and
geometric accuracy of hyperspectral data is significantly different from one band to
another band (Varshney and Arora, 2004).
5 
 
1.2 What is classification?
Classification means to put data into groups according to their characteristics.
In the case of spectral classification, the areas of the image that have similar spectral
reflectance are put into same group or class (Abhinav, 2009). Classification is also
seen as a means of compressing image data by reducing the large range of digital
number (DN) in several spectral bands to a few classes in a single image.
Classification reduces this large spectral space into relatively few regions and
obviously results in loss of numerical information from the original image. Depending
on the availability of information of the region which is imaged, supervised or
unsupervised classification methods are performed.
1.2.1 Difficulties in hyperspectral data classification
Though it is possible that HD can provide a high accuracy thematic map than
MS data, there are some difficulties in classification in case of high dimensional data
as listed below:
1. Curse of dimensionality and Hughes phenomenon: It says that when
the dimensionality of data set increases with the number of bands, the
number of training pixels (TP) required for training a specific classifier
should be increased as well to achieve the desired accuracy for
classification. It becomes very difficult and expensive to obtain large
number of TP for each sub class. This has been termed as “curse of
dimensionality” by Bellman (1960), which leads to the concept of “Hughes
phenomenon” (Hughes, 1968).
2. Characteristics of high dimensional space: The characteristics of high
dimensional space have been discussed in above section (Sec. 1.1.2). For
those reasons, the algorithms that are used to classify the multispectral
data often fail for hyperspectral data.
3. Large number of highly correlated bands: Hyperspectral sensor uses
the large number of contiguous spectral bands. Therefore, among these
bands, some bands are highly correlated. These correlated bands do not
provide good result in classification. Therefore, the important task is to
6 
 
select the uncorrelated bands or make the bands uncorrelated, applying
feature reduction algorithms (Varshney and Arora, 2004).
4. Optimum number of feature: It is very critical to select the optimum
number of bands out of large number of bands (e.g. 224 bands for AVIRIS
image) to use in classification. Till today there are no suitable algorithms or
any rule for selection of optimal number of features.
5. Large data size and high processing time due to complexity of
classifier: Hyperspectral imaging system provides large amount of data. So
large memory and powerful system is necessary to store and handle the
data, generally which is very expensive.
1.3 Background of work
This thesis work is the extension of work done by Abhinav Garg (2009) in his
M.Tech thesis. In his thesis, he showed that among the conventional classifiers
(gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER),
GML provides the best result. The performance of GML is improved significantly
after applying feature extraction (FE) techniques. Principal component analysis
(PCA) was found to be working best, among all FE techniques (discriminant analysis
FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and
independent component analysis(ICA)), in improving classification accuracy of GML.
For the advance classifier, SVM’s result does not depend on the choice of
parameters but ANN’s does. He also showed SVM’s result was improved by using
PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE
failed to improve it significantly.
He showed some drawbacks for advanced classifier like SVM and suggested
some FE techniques which may improve the result for conventional classifier (CC) as
well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes
more processing time than small size of TP. The objectives of this thesis work are to
sort out these problems and to find the best FE technique, which will improve the
classification result for HD. In next article, the objective of this thesis work has been
described.
.
7 
 
1.4 Objectives
This thesis has investigated the following two objectives pertaining to
classification with hyperspectral data:
Objective-1:
To evaluate various FE techniques for classification of hyperspectral data.
Objective-2
To study the extent to which advance classifier can reduce problems related to
classification of hyperspectral data.
1.5 Study area and data set used
The study area for this research is located within an area known as 'La
Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig.
1.4). The area is mainly used for cultivation of wheat, barley and other crops such as
vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on
29th June, 2000, at 5 m resolution.
Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an
exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to
2.5 μm were selected for further analysis (Pal, 2002). Striping problems were
observed between bands 41 and 72. All the 72 bands were visually examined and 7
bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were
removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels
covering the area of interest was extracted (Abhinav, 2009).
The data set available for this research work includes the 65 (retained after
pre-processing) bands data and the reference image, generated with the help of field
data collected by local farmers as briefed in Pal (2002). The area included in imagery
was found to be divided into eight different land cover types, namely wheat, water
body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built
up area.
 
8 
 
Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002)
Figure 1.4: FCC obtained by first 3 principal components and superimposed
reference image showing training data available for classes
identified for study area (Pal, 2002).
9 
 
Figure 1.5: Google earth image of study area (Google earth, 2007)
1.6 Software details
For the processing of HD very power full system is required due to the size of
data set and complexity of algorithms. The machine used for this thesis work
contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7.
Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results
are obtained here from same machine for the comparison of different algorithm.
1.7 Structure of thesis
The present thesis is organized into six chapters. Chapter1 focuses on the
characteristics of high dimensional space, challenges of HD classification and outline
of the experiments of this thesis work. Also it discusses the study region, data set and
the software used in this thesis work. Chapter 2 presents the detailed description of
the HD classification and the previous research work related to this domain. Chapter
3 describes the detailed mathematical background of the different processes used in
this work. Chapter 4 outlines the detailed methodology carried out for this thesis
work. Chapter 5 presents the experiments which are conducted for this thesis
followed by interpretation. Chapter 6 provides the conclusions for present work and
the scopes for future works.
10 
 
CHAPTER 2
LITERATURE REVIEW 
This chapter outlines the important research works and major achievements in
the field of high dimensional data analysis and data classification. The chapter begins
with some of the FE techniques and classification approaches, for solving problems
related to HD classification as suggested by various researchers. The results of useful
experiments with the HD will also be included to highlight the usefulness and
reliability of these approaches. These results are presented in tabulated form. Some
other issues related to classification of HD are also discussed at the end of this
chapter.
2.1 Dimensionality reduction by
Swain and Davis (1978) mentioned details of various separability measures for
multivariate normal class models. Various statistical classes are found to be
overlapping which causes error of misclassification as most of the classifiers use
decision boundary approach for classification. The idea was to obtain such a
separability measure which could give an overall estimate of range of classification
accuracies that can be achieved by using a sub-set of selected features so that the
sub-set of features corresponding to highest classification accuracy can be selected for
classification (Abhinav, 2009).  
FE is the process of transforming the given data from a higher dimensional
space to a lower dimensional space while conserving the underlying information
(Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the
underlying information spread in high dimensional space by containing it into
comparatively smaller number of dimensions without loss of significant amount of
useful information. FE techniques, in case of classification, try to enhance class
separability while reducing data dimensionality (Abhinav, 2009).
11 
 
2.1.1 Segmented principal component analysis (SPCA)
The principal component transform (PCT) has been successfully applied in
multispectral data for feature reduction. Also it can be used as the tool of image
enhancement and digital change detection (Lodwick, 1979). For the case of dimension
reduction of HD, PCA outperforms those FE techniques which are based on class
statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited
and ratio to the number of dimension is low for HD, class covariance matrix cannot be
estimated properly. To overcome these problems Jia (1996) proposed the scheme for
segmented principal component analysis (SPCA) which applies PCT on each of the
highly correlated blocks of bands. This approach also reduces the processing time by
converting the complete set of bands into several highly correlated bands. Jensen and
James (1999) proposed that the SPCA-based compression generally outperforms
PCA-based compression in terms of high detection and classification accuracy on
decompressed HD. PCA works efficiently for the highly correlated data set but SPCA
works efficiently for both high correlated as well as low correlated data sets (Jia,
1996).
Jia (1996) compared SPCA and PCA extracted features for target detection and
concluded SPCA as a better FE technique than PCA. She also showed that both
feature extracted data sets are identical and there is no loss of variance in the middle
stages, as long as no components are removed.
2.1.2 Projection pursuit (PP)
  Projection pursuit (PP) methods were originally posed and experimented by
Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman
and Tukey (1974). They described PP as a way of searching for and exploring
nonlinear structure in multi-dimensional data by examining many 2-D projections.
Their goal was to find interesting views of high dimensional data set. The next stages
in the development of the technique were presented by Jones (1983) who, amongst
other things, developed a projection index based on polynomial moments of the data.
Huber (1985) presented several aspects of PP, including the design of projection
indices. Friedman (1987) derived a transformed projection index. Hall (1989)
developed an index using methods similar to Friedman, and also developed
12 
 
theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b)
introduced a projection index called the chi-square projection pursuit index. Posse
(1995a, 1995b) used a random search method to locate a plane with an optimal value
of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. Each projection found in this
manner shows a structure that is less important (in terms of the projection index)
than the previous one. Most recently, the PP technique can also be used to obtain 1-D
projections (Martinez, 2005). In this research work, Posse’s method is followed that
reduces n-dimensional data set to 2-dimensional data.
2.1.3 Orthogonal subspace projection (OSP)
Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP)
method which simultaneously reduces the data dimensionality, suppresses undesired
or interfering spectral signatures, and detects the presence of a spectral signature of
interest. The concept is to project each pixel vector onto a subspace which is
orthogonal to the undesired pixel. In order to make the OSP to be effective, number of
bands must not be taken less than the number of signatures. It is a big limitation
associated with multispectral image. To overcome this, Ren and Chang (2000)
presented the Generalized OSP (GOSP) method that relaxes this constraint in such a
manner that the OSP can be extended to multispectral image processing in an
unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci,
2001) and also for magnetic resonance image classification (Wang et.al, 2001).
2.1.4 Kernel principal component analysis (KPCA)
Linear PCA always detect all structure in a given data set. By the use of
suitable nonlinear feature extractor, more information can be extracted from the data
set. The kernel principal component analysis (KPCA) can be used as a strong
nonlinear FE method (Scholkopf and Smola, 2002) which maps the input vectors to
feature space and then PCA is applied on the mapped vectors. KPCA is also a
powerful method for preprocessing steps for classification algorithm (Mika et. al.
1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature
selection in a high-dimensional feature space where input variables were mapped by
13 
 
a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of
the higher-order statistics. To obtain this higher-order statistics, a large number of
TP is required. This causes problems for KPCA, since KPCA requires storing and
manipulating the kernel matrix whose size is the square of the number of TP. To
overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian
Algorithm (KHA) was introduced by (Scholkopf et. al., 2005).
2.2 Parametric classifiers
Parametric classifiers (Fukunaga, 1990) require some parameters to develop
the assumed density function model for the given data. These parameters are
computed with the help of a set of already classified or labeled data points called
training data. It is a subset of given data for which the class labels are known and is
chosen by sampling techniques (Abhinav, 2009). It is used to compute some class
statistics to obtain the assumed density function for each class. Such classes are
referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon
the training data and may differ from the actual classes.
2.2.1 Gaussian maximum likelihood (GML)
Maximum likelihood method is based on the assumption that the frequency
distribution of the class membership can be approximated by the multivariate normal
probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one
of the most popular parametric classifiers that has been used conventionally for
purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages
of GML classification method are that, it can obtain minimum classification error
under the assumption that the spectral data of each class is normally distributed and
it not only considers the class centre but also its shape, size and orientation by
calculating a statistical distance based on the mean values and covariance matrix of
the clusters (Lillesand et al., 2002).
Lee and Landgrebe (1993) compared the result of GML classifier on PCA and
DBFE feature extracted data set and concluded that DBFE feature extracted data set
provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE
techniques were compared for classification accuracy achieved by nearest neighbor
14 
 
and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is
better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA,
DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed
that PCA is the best FE technique for HD among the other mentioned feature
extractor for GML classifier.  He also suggested that some FE techniques like KPCA,
OSP, SPCA, PP may improve the classification result using GML classifier.
2.3 Non–parametric classifiers
The non–parametric classifiers (Fukunaga, 1990) uses some control
parameters, carefully chosen by the user, to estimate the best fitting function by
using an iterative or learning algorithm. They may or may not require any training
data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor
(KNN) (Cover and Hart, 1967) are two popular working classifiers under this
category. Edward (1972) gave brief descriptions of many non-parametric approaches
for estimation of data density functions.
2.3.1 KNN
KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern
recognition. The technique can achieve high classification accuracy in problems which
have unknown and non-normal distributions. However, it has a major drawback that
a large amount of TP is required in the classifiers resulting in high computational
complexity for classification (Hwang and Wen, 1998).
Pechenizkiy (2005) compared the performance of KNN classifier on the PCA
and random projection (RP) feature extracted data set. He concluded that KNN
performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the
KNN works better on the ICA feature extracted data set than the original data set
(OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA-
KNN method with a few wavelengths had the same performance as the KNN
classifier alone using information from all wavelengths.
Some more non–parametric classifiers based on geometrical approaches of data
classification were found during literature survey. These approaches consider the
data points to be located in the Euclidean space and exploit the geometrical patterns
of the data points for classification. Such approaches are grouped into a new class of
15 
 
classifiers known as machine learning techniques. Support Vector Machines (SVM)
(Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among
the popular classifiers of this kind. These do not make any assumptions regarding
data density function or the discriminating functions and hence are purely non–
parametric classifiers. However, these classifiers also need to be trained using the
training data.
2.3.2 SVM
SVM has been considered as advance classifier. SVM is a new generation of
classification techniques based on Statistical Learning Theory having its origins in
Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995,
1998) discussed SVM based classification in detail. SVM tends to improve learning by
empirical risk minimization (ERM) to minimize learning error and to minimize the
upper bound on the overall expected classification error by structural risk
minimization (SRM). SVM makes use of principle of optimal separation of classes to
find a separating hyperplane that separates classes of interest to maximum extent by
maximizing the margin between the classes (Vapnik, 1992). This technique is
different from that of estimation of effective decision boundaries used by Bayesian
classifiers as only data vectors near to the decision boundary (also known as support
vectors) are required to find the optimal hyperplane. A linear hyperplane may not be
enough to classify the given data set without error. In such cases, data is transformed
to a higher dimensional space using a non–linear transformation that spreads the
data apart such that a linear separating hyperplane may be found. Kernel functions
are used to reduce the computational complexity that arises due to increased
dimensionality (Varshney and Arora, 2004).
Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization
capability and ability to adapt their learning characteristics by using kernel functions
due to which they can adequately classify data on a high–dimensional feature space
with a limited number of training data sets and are not affected by the Hughes
phenomenon and other affects of dimensionality. The ability to classify using even
limited number of training samples make SVM as a very powerful classification tool
for remotely sensed data. Thus, SVM has the potential to produce accurate
classifications from HD with limited number of training samples. SVMs are believed
16 
 
to be better learning machines than neural networks, which tends to overfit classes
causing misclassification (Abhinav, 2009), as they rely on margin maximization
rather than finding a decision boundary directly from the training samples.
For conventional SVM an optimizer is used based on quadratic programming
(QP) or linear programming (LP) methods to solve the optimization problem. The
major disadvantage of QP algorithm is the storage requirement of kernel matrix in
the memory. When the size of the kernel matrix is large enough, it requires huge
memory that may not be always available. To overcome this Benett and Campbell
(2000) suggested an optimization method which sequentially updates the Lagrange
multipliers called the kernel adatron (KA) algorithm. Another approach was
decomposition method which updates the Lagrange multipliers in parallel since they
update many parameters in each iteration unlike other methods that update
parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which
updates lagrange multipliers on the fixed size working data set. Decomposition
method uses QP or LP optimizer to solve the problem of huge data set by considering
many small data sets rather than a single huge data set (Varshney, 2001). The
sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of
decomposition method when the size of working data set is fixed such that an
analytical solution can be derived in very few numerical operations. This does not use
the QP or LP optimization methods. This method needs more number of iterations
but requires a small number of operations thus results in an increase in optimization
speed for very large data set.
The speed of SVM classification decreases as the number of support vectors
(SV) decreases. By using kernel mapping, different SVM algorithms have successfully
incorporated effective and flexible nonlinear models. There are some major difficulties
for large data set due to calculation of nonlinear kernel matrix. To overcome the
computational difficulties, some authors have proposed low rank approximation to
the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002)
have proposed the method of reduced support vector machine (RSVM) which reduces
the size of the kernel matrix. But there was a problem of selecting the number of
support vectors (SV). In 2009, Sundaram proposed a method which will reduce the
number of SV through the application of KPCA. This method is different from other
17 
 
proposed method as the exact choice of support vector is not important as long as the
vector spanned a fixed subspace.
Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he
used linear SVM on the feature extracted data set and showed that KPCA features
are more linearly separable than the features extracted by conventional PCA. Shah et
al (2003) compared SVM, GML and ANN classifiers for accuracies at full
dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and
concluded that SVM  gives higher accuracies than GML and ANN for full
dimensionality but poor accuracies for features extracted by DAFE and DBFE.
Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE,
DBFE, DAFE feature extracted data set. He concluded that SVM provides better
result for OD than GML. SVM works best with PCA and ICA feature extracted data
set where ANN works better with DBFE and NWFE feature extracted data set.
The works done by various researchers with different hyperspectral data sets
using different classifiers and FE methods and the results obtained by them is
summarized in Table 2.1.
 
18 
 
Table 2.1: Summary of literature review
Author Dataset used Method used Results obtained
Lee and Landgrebe
(1993)
Field Spectrometer
System (airborne
hyperspectral
sensor)
GML classifier is used to
compare classification
accuracies obtained by
DBFE and PCA FE
Features extracted by DBFE
produces better classification
accuracies than those
obtained from PCA and
Bhattacharya feature
selection methods.
Jimenez and
Landgrebe (1998)
Stimulated and real
AVIRIS data
Hyperspectral data
characteristics were
studied with respect to
effects of
dimensionality, order of
data statistics used on
supervised classification
techniques.
Hughes phenomenon was
observed as an effect of
dimensionality and
classification accuracy was
observed to be increasing
with use of higher statistics
order. But lower order
statistics were observed to
be less affected by Hughes
phenomenon.
Benediktsson et al
(2001)
ROSIS-03 KPCA and PCA feature
extracted data set was
used for classification
using linear SVM.
KPCA features are more
linearly separable than
features extracted by
conventional PCA.
Shah et al. (2003) AVIRIS Compared SVM, GML
and ANN classifiers for
accuracies at full
dimensionality and
using DAFE and DBFE
feature extraction
techniques
SVM was found to be giving
higher accuracies than GML
and ANN for full
dimensionality but poor
accuracies were obtained for
features extracted by DAFE
and DBFE.
Kuo and Landgrebe
(2004)
Stimulated and real
data (HYDICE
image of DC mall,
Washington, US)
NWFE and DAFE FE
techniques were
compared for
classification accuracy
achieved by nearest
neighbor and GML
classifiers.
NWFE was found to be
producing better
classification accuracies
than DAFE.
Pechenizkiy (2005) 20 data sets with
different
characteristics were
taken from the UCI
machine learning
repository.
KNN classifier was used
to compare classification
accuracies obtained by
PCA and Random
Projection FE
PCA gave the better result than
Random Projection
Zhu et al (2007) Hyperspectral
imaging system
developed by ISL.
ICA ranking methods
were used to select the
optimal wave length the
KNN was used. Then
KNN alone was used.
ICA-KNN method with a few
band had the same
performance as the KNN
classifier alone using all
bands.
Sundaram (2009) The adult dataset
,part of UCI
Machine Learning
Repository
KPCA was applied in
the support vector, then
usual SVM algorithm is
used
Significantly reduce the
processing time without
effecting the classification
accuracy
19 
 
Abhinav (2009) DAIS 7915 GML, SAM, MDM
classification techniques
were used on the PCA,
ICA, NWFE, DBFE and
DAFE feature extracted
data set
GML was the best among
the other techniques and
performs best on PCA
extracted data set.
Abhinav (2009) DAIS 7915 SVM and GML
classification techniques
were used on the OD
and PCA, ICA, NWFE,
DBFE and DAFE
feature extracted data
set to compare the
accuracy
GML performed very low in
OD than SVM. SVM provide
better accuracy than GML.
SVM performs better on
PCA and ICA extracted data
set.
2.4 Conclusions from literature review
 
1. From Table 2.1, it can be easily concluded that the FE techniques like PCA,
ICA, DAFE, DBFE and NWFE perform well in improving the classification
accuracies when used with GML. But the features extracted by DBFE and
DAFE failed to improve results obtained by SVM implying a limitation of these
techniques for the advance classifiers. KNN works best with PCA and ICA
feature extracted data set. However, in the surveyed literature the effects of
PP, SPCA, KPCA and OSP extracted features on classification accuracy
obtained from the advance classifiers like SVM, parametric classifier like GML
and nonparametric classifier KNN have not been observed.
2. Another important aspect found missing in the literature is the comparison of
classification time for SVM classifiers because SVM takes long time for
training using large TP. It was seen that many approach of SVM were
proposed to reduce the classification time but there is no conclusion for the best
SVM algorithm depending on classification accuracy and processing time.
3. Although KNN is effective classification technique for HD, there is no guideline
for classification time or suggestion of best FE techniques for KNN classifier.
Also the effect of different parameters like number of nearest neighbor,
number of TP, number of bands is not suggested for KNN.
20 
 
4. During the literature survey, it is further found that there is no suggestion for
the best FE techniques for different SVM algorithms, GML and KNN.
Such missing aspects will be investigated in this thesis work and the
guidelines to choose an efficient and less time consuming classification technique
shall be presented as the result of this research.    
This chapter presented the FE and classification techniques for mitigating the
effects of dimensionality. These techniques were result of different approaches used
to deal with the problem of high dimensionality and improving performance of
advance, parametric and nonparametric classifier. The approaches were applied on
real life HD and comparative results as reported in literature were compiled and
presented here. In addition, the important aspects found missing in the literature
survey were highlighted which this thesis work shall try to investigate. The
mathematical rationale and algorithms used to apply these techniques will be
discussed in detail in the next chapter.
 
21 
 
CHAPTER 3
MATHEMATICAL BACKGROUND
This chapter will provide the detailed mathematical background of each of the
techniques used in this thesis. Starting with the some basic concepts of kernels and
kernel space this chapter will describe the unsupervised and supervised FE
techniques followed by classification and optimization rules for supervised classifier.
Finally, the scheme for statistical analysis which has been used for comparing the
results of different classification techniques are discussed.
Notations which are followed in this chapter for matrix and vector are given
below:
X A two dimensional matrix, whose columns represent the data points (m) and
rows represent number of bands (n), where ,X X n m= ⎡ ⎤⎣ ⎦.
ix n -dimensional single pixel column vector where 1 2, ......., mX x x x= ⎡ ⎤⎣ ⎦and
1 2, ,.....,
T
i i i nix x x x= ⎡ ⎤⎣ ⎦
jc Represents jth class.
( )zΦ Mapping of the input vector z in kernel space, using some kernel function.
,a b Defines inner product of the vectors a and b.
∈ Belongs to
n
R Set of n-dimensional real number.
N Set of natural number.
T
⎡ ⎤⎣ ⎦ Denotes the transpose of a matrix.
∀ For all.
3.1 What is kernel?
  Before defining kernel, let’s look at the following two definitions:
• Input space: The space where originally data points lie.
22 
 
• Feature space: The space spanned by the transformed data points (from
original space) which were mapped by some functions.
Kernel is the dot product in feature space H via a map Φ from input space,
such that :X HΦ → . Kernel can be defined as ( , ') ( ), ( ')k x x x x= Φ Φ , where
, ' and ( ), ( ')x x x xΦ Φ are the elements of input space and feature space respectively
and k is called the kernel and Φ is called feature map associated with k. Φ also can
be called as the kernel function. The space containing these dot products is called
kernel space. This is a nonlinear mapping from input space to feature space which
increases the internal distance between two points in a data set. This means that the
data set which is nonlinearly separable in input space becomes linearly separable in
kernel space. A few definitions related to kernel are given below:
Gram matrix: Given a kernel k and inputs 1 2, ........., nx x x X∈ , the xn n matrix,
: ( ( , ))i j ijK k x x= is called the gram matrix of k with respect to 1 2, ........., nx x x X∈ .
Positive definite matrix: A real xn n symmetric matrix K satisfying 1 1 0T
x Kx > for
all ( )1 11 21 1, ,.......,
T n
nx x x x R= ∈ is called positive definite. 1x is a column vector. If the
equality in previous equation occurs only for 11 21 1........ 0nx x x= = = = , then the matrix
is called strictly positive definite.
Positive definite kernel: Let X be a nonempty set. A function :k X X R× → , ∀
, ,in N x X i N∈ ∈ ∈ if it gives rise to a positive definite gram matrix, is called a
positive definite kernel. A function :k X X R× → ∀ n N∈ and distinct ix X∈ if it
gives rise to a strictly positive definite gram matrix, called strictly positive definite
kernel.
Definitions of some commonly used kernel functions are shown in Table 3.1.
23 
 
Table 3.1: Examples of common kernel functions (Modified after Varshney and
Arora, 2004)
Kernel function type Definition
( , )iK x x
Parameters Performance depends on
Linear ix x× Decision boundary either
linear or non linear
Polynomial with
degree n
( 1)n
ix x× + n is a positive integer
User defined parameters
Radial basis function
2
2
( - )
exp
2
ix x
σ
⎛ ⎞
⎜ ⎟−
⎜ ⎟
⎝ ⎠
σ is a user defined
value
User defined parameters
Sigmoid tanh( ( . ) )ik x x + Θ K and Θ are user
defined parameter
User defined parameters
All the above definitions have been explained with the following simple
example.
Let, 1 2 3
1 2 1
2 1 3
1 1 3
X x x x
⎡ ⎤
⎢ ⎥= =⎡ ⎤⎣ ⎦ ⎢ ⎥
⎢ ⎥⎣ ⎦
is a matrix in input space whose columns ( , 1,2,3ix i = )
denote the number of data points and rows denote the dimension of data points.
Let, by using Gaussian kernel function, this matrix be mapped in to the feature space.
Let ,i jx x denotes the inner product of the columns of the matrix X using Gaussian
kernel function.
Then the gram matrix (kernel matrix) K takes precisely the form,
1 1 1 2 1 3
2 1 2 2 2 3
3 1 3 2 3 3
, , ,
, , ,
, , ,
x x x x x x
K x x x x x x
x x x x x x
⎡ ⎤
⎢ ⎥
= ⎢ ⎥
⎢ ⎥
⎢ ⎥⎣ ⎦
The numerical value of the matrix K is,
1.0000 0.0498 0.0821
0.0498 1.0000 0.6065
0.0821 0.6065 1.0000
K
⎡ ⎤
⎢ ⎥= ⎢ ⎥
⎢ ⎥⎣ ⎦
K is symmetric matrix. If the matrix K turns out to be positive definite, then it is
called positive definite kernel and if it is strictly positive definite, then it is called
strictly positive definite kernel.
24 
 
3.2 Feature extraction techniques
FE techniques are based on a simple assumption that given data sample
( : )n
x X R∈ belonging to an unknown probability distribution in n-dimensional space
can be represented by some coordinate system in m dimensional space (Carreira-
Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate
system such that when the data points from higher dimensional space are projected
onto it, a dimensionally compact representation of these data points will be obtained.
There are two following main conditions to obtain an optimal dimension reduction
(Carreira-Perpinan, 1997):
(i) Elimination of dimensions with very low information content. Features with
low information content can be discarded as noise.
(ii) Remove redundancy among the dimensions of data space i.e. the reduced
feature set should be spanned by orthogonal vectors.
The unsupervised and supervised FE techniques have been investigated in this
research work (Figure 3.1). For the unsupervised approach, segmented principal
component analysis (SPCA), projection pursuit (PP) and for supervised FE technique,
kernel principal component analysis (KPCA) and orthogonal subspace projection
(OSP) are used. The next sub-sections will discuss the assumptions used by these FE
techniques in detail.
Figure 3.1: Overview of FE methods
25 
 
3.2.1 Segmented principal component analysis (SPCA)
  The principal component transform (PCT) has been successfully applied in
multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral
image data, PCT outperforms those FE techniques which are based on the class
statistics. The main advantage of using a PCT is that global statistics are used to
determine the transform functions. Implementation of PCT on high dimensional data
set requires high computational load. SPCA can overcome the problem of long
processing time by partitioning the complete data set into several highly correlated
subgroups (Jia, 1996).
The complete data set is first partitioned into K subgroups with respect to the
correlation of bands. From the correlation image of HD, it can be seen that blocks are
formed from highly correlated bands (Figure 3.2). These blocks are selected as the
subgroups. Let 1n , 2n and kn are the number of bands in subgroups 1, 2 and k
respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After
applying PCT on each subgroup, significant features are selected by variance
information of each component. The PCs which contain about 99% variance were
chosen for each block then the selected features can be regrouped and transformed
again to compress the data further.
26 
 
Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27
bands respectively, corresponding to highly correlated bands have been
formed from the correlation image of HYDICE hyperspectral sensor data.
Segmented PCT retains all the variance as with the conventional PCT. There
is no information lost either in the case that the transformation is conducted on the
complete vector at once or a few sub vectors are transformed separately (Jia, 1996).
When the new components obtained from each segmented PCT are gathered and
transformed again, then the resulting data variance and covariance are identical to
those of the conventional PCT. The main effect is that, the data compression rate is
lower in the middle stages compared to the no segmentation case. However, it makes
a relatively small difference in compression rate, if segmented transformation is
developed on those subgroups which have poor correlation with each other.
27 
 
Figure 3.2a: Chart of multilayered segmented PCA
3.2.2 Projection pursuit (PP)
Projection pursuit (PP) refers to a technique first described by Friedman and
Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by
means of selected low dimensional linear projections. To reach this goal, an objective
function is assigned, called projection index, to every projection characterizing the
structure present in the projection. Interesting projections are then automatically
picked up by optimizing the projection index numerically. The notion of interesting
projections has usually been defined as the ones exhibiting departure from normality
(normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985).
Posse (1990) proposed an algorithm based on a random search and a chi-
squared projection index for finding the most interesting plane (two-dimensional
view). The optimization method was able to locate in general the global maximum of
the projection index over all two-dimensional projections (Posse, 1995). The chi-
squared index was efficient, being fast to compute and sensitive to departure from
normality in the core rather than in the tail of the distribution. In this investigation
only chi-squared (Posse, 1995a, 1995b) projection index has been used.
28 
 
Projection pursuit exploratory data analysis (PPEDA) consists of following two parts:
(i) A projection pursuit index measures the degree of departure from normality.
(ii) A method for finding the projection that yields the highest value for the index.
Posse (1995a, 1995b) used a random search to locate a plane with an optimal
value of the projection index and combined it with the structure removal of Friedman
(1987) to get a sequence of interesting 2-D projections. The interesting projections are
found in decreasing order of the value of the PP index. This implies that each
projection found in this manner shows a structure that is less important (in terms of
the projection index) than the previous one. In the following discussion, first the chi-
squared PP index has been described followed by the structure finding procedure.
Finally, the structure removal procedure is illustrated.
3.2.2.1 Posse chi-square index
Posse proposed an index based on the chi-square index. The plane is first
divided into 48 regions or boxes kB , 1,2,..,48k = that are distributed in the form of
rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the
same angular width of 0
45 . R is chosen so that the boxes have approximately the
same weight under normally distributed data and which is equal to
( )
1
22log6
5
. The
outer boxes were having weight 1/48 under normally distributed data. This choice for
the radial width provides regions with approximately same probability for the
standard bivariate normal distribution (Martinez, 2001). The projection index is
given as:
( ) ( )2
2
8 48
( ) ( )
0 1 1
1 1 1
, ,
9
j j
k
n
B i i k
j k ik
PI I z z c
c n
α λ β λ
χ
α β
= = =
⎡ ⎤
= −⎢ ⎥
⎣ ⎦
∑∑ ∑ (3.1)
Where,
φ The standard bivariate normal density.
kc Probability evaluated over kth region using the normal density function,
given by 1 2
k
k
B
c dz dzφ= ∫∫ .
29 
 
kB Box in the projection plane.
jλ , 0,.....,8
36
j
j
π
= is the angle by which the data are rotated in the plane
before being assigned to regions.
,α β Orthonormal p-dimensional vectors which span the projection plane (It
can be first two PCs or randomly chosen two pixels of the OD set).
( , )P α β A plane consists of two orthonormal vectors ,α β
,i jZ Zα β
Sphered observations projected onto the vectors andα β . ( T
i iZ Zα
α= and
T
i iZ Zβ
β= )
( )jα λ cos sinj jα λ β λ−
( )jβ λ sin cosj jα λ β λ+
kBI The indicator functions for region.
( )2 ,PIχ
α β The chi-squareprojection index evaluated using the data projected onto
the plane spanned byα and β .
The chi-square projection index is not affected by the presence of outliers.
However, it is sensitive to distributions that have a hole in the core, and it will also
yield projections that contain clusters. The chi-square projection pursuit index is fast
and easy to compute, making it appropriate for large sample sizes. Posse (1995a)
provides a formula to approximate the percentiles of the chi-square index.
30 
 
R
R/5
45o
1/48 1/48
1/48
1/48
1/48
1/48
1/48
1/48
 
Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after
Posse, 1995a)
3.2.2.2 Finding the structure (PPEDA algorithm)
For PPEDA projection pursuit index, ( )2 ,PIχ
α β must be optimized over all
possible direction onto 2-D planes. Posse (1990) proposed a random search for
locating the global maximum of the projection index. Combined with the structure-
removal procedure, this gives a sequence of interesting bi-dimensional views of
decreasing importance. Starting with random planes, the algorithm tries to improve
the current best solution ( )* *
,α β by considering two candidate planes ( )1 1,a b and
( )2 2,a b within a neighborhood of ( )* *
,α β . These candidate planes are given by,   
( )
( )
( )
( )
* **
1 11
1 1* * *
1 1 1
* **
1 21
2 2* * *
1 1 2
T
T
T
T
a acv
a b
cv a a
a acv
a b
cv a a
β βα
α β β
β βα
α β β
⎫−+
⎪= =
+ ⎪− ⎪
⎬
− ⎪−
= = ⎪
− − ⎪⎭
(3.2)
Where c is a scalar that determines the size of the neighborhood visited, and v is a
unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to
31 
 
start a global search and then to concentrate on the region of the global maximum by
decreasing the value of c. After a specified number of steps, called half, without an
increase of the projection index, the value of c is halved. When this value is small
enough, the optimization is stopped. Part of the search still remains global to avoid
being kept in dummy local optimum. The complete search of the best plane contains
m such random searches with different random starting planes. The goal of PP
algorithm is to find best projection plane.
The steps for PPEDA are given below:
1. Sphere the OD set, let’s say,Z is the matrix of sphered data set.
2. Generate a random starting plane ( )0 0
,α β , where 0 0
andα β are orthonormal.
Consider this as the current best plane ( )* *
,α β .
3. Evaluate the projection index ( )2
* *
,PIχ
α β for the starting plane.
4. Generate two candidate plane ( )1 1,a b and ( )2 2,a b according to the Eq. (3.2)
5. Now calculate the projection index for these candidate planes.
6. Choose the candidate plane with a higher value of the projection pursuit index
as the current best plane ( )* *
,α β .
7. Repeat steps 4 through 6 while there are improvements in the projection
pursuit index.
8. If the index does not improve for certain time, then decrease the value of c by
half
9. Repeat step 4 to step 8 until c becomes some small number (say .01).
3.2.2.3 Structure removal
There may be more than one interesting projection, and there may be other
views that reveal insights about the hyperspectral data. To locate other views,
Friedman (1987) proposed a method called structure removal. In this approach, first
we perform the PP algorithm on the data set to obtain the structure which means the
optimal projection plane. The approach then removes the structure found at that
projection, and repeats the projection pursuit process to find a projection that yields
another maximum value of the projection pursuit index. By proceeding in this
32 
 
manner, it will give a sequence of projections providing informative views of the data.
The procedure repeatedly transforms the projected data to standard normal until
they stop becoming more normal as measured by the projection pursuit index. One
starts with a p p× matrix, where the first two rows of the matrix are the vectors of
the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal
and ‘0’ elsewhere. For example, if p = 4, then 
* * * *
1 2 3 4
* * * *
* 1 2 3 4
0 0 1 0
0 0 0 1
U
α α α α
β β β β
⎡ ⎤
⎢ ⎥
⎢ ⎥=
⎢ ⎥
⎢ ⎥
⎢ ⎥⎣ ⎦
(3.3)
Gram-Schmidt orthonormalization process (Strang, 1988) makes the rows of *
U
orthonormal. Let U is the orthonormal matrix of *
U . The next step in the structure
removal process is to transform the Z matrix using the following equation,   
T
T UZ= (3.4)
Where T is a p n× matrix. With this transformation, the first two rows of T of every
transformed observations are the projection onto the plane given by ( )* *
,α β . Now
applying a transformation (Θ ), which transforms the first two rows of T to a
standard normal and the rest remain unchanged, structure removal is performed
(Martinez, 2004). This is where the structure is removed, making the data normal in
that projection (the first two rows). The transformation is defined as follows,   
( ) ( )
( ) ( )
( )
1
1 1
1
2 2
3,4,.........,i i
T F T
T F T
T T i p
φ
φ
−
−
⎫⎡ ⎤Θ = ⎣ ⎦ ⎪⎪
⎡ ⎤Θ = ⎬⎣ ⎦
⎪
Θ = = ⎪⎭
(3.5)
Where 1
φ−
the inverse of the standard normal cumulative distribution function, 1T
and 2T are the first two rows of the matrix T and F is a function defined in Eq. (3.7).
From Eq. (3.3), it is seen that only the first two row of T are changing. 1T and 2T can
be written as, 
( )
( )
* * * *
* * * *
1 1 2
2 1 2
, ......., ,.......,
, ......., ,.......,
j n
j n
T z z z z
T z z z z
α α α α
β β β β
=
=
(3.6)
33 
 
Where
*
jzα
and
*
jzβ
are coordinates of the jth observation projected onto the plane
spanned by( )* *
,α β . Next, a rotation is defined about the origin through the angle as
follows 
( ) ( ) ( )
( ) ( ) ( )
1 1 2
2 2 1
cos sin
cos sin
t t t
j j j
t t t
j j j
z z z
z z z
γ γ
γ γ
= +
= −
(3.7)
Where 0, / 4, / 8,3 / 8γ π π π= and ( )1 t
jz represents the jth element of 1T at the tth
iteration of the process. Now, applying the following transformation on Eq. (3.7) to the
rotated points it replaces each rotated observation by its normal score in the
projection.   
( )
( )
( )
( )
( )
( )
1
1 1 1
2
2 1 1
0.5
.5
t
jt
j
t
jt
j
r z
z
n
r z
z
n
φ
φ
+ −
+ −
⎧ ⎫−⎪ ⎪
= ⎨ ⎬
⎪ ⎪⎩ ⎭
⎧ ⎫−⎪ ⎪
= ⎨ ⎬
⎪ ⎪⎩ ⎭
(3.8)
Where ( )
( )1 t
jr z represents the rank of ( )1 t
jz
With this procedure, the projection index is reduced by making the data more
normal. During the first few iteration, the projection index should decrease rapidly
(Friedman, 1987). After approximate normality is obtained, the index might oscillate
with small changes. Usually, the process takes between 5 to 15 complete iterations to
remove the structure. Once the structure is removed using this process, data is
transformed back using the following equation,   
( )T T
Z U UZ′ = Θ (3.9)
From Matrix Theory (Strang, 1988), it is known that all directions that are
orthogonal to the structure (i.e., all rows of T other than the first two) have not been
changed, whereas the structure has been Gaussianized and then transformed back.
Next section will describe the summary of the steps of PP,
34 
 
3.2.2.4 Steps of PP
1. Load the data and set the value of the parameters like number of best
projection plane (N), number of neighborhood for random starts (m), value of c
and half
2. Sphere the data and obtain the Z matrix.
3. Find each of the desired number of projection plane (structures) (3.3.4.2) using
Posse chi-squareindex.
4. Remove the structure (to reduce the effect of local optimum) and find another
structure (3.3.4.3) until the projection pursuit index stop changing.
5. Continue the process until the best projection plane (orthogonal to each other)
is obtained.
3.2.3 Kernel principal component analysis (KPCA)
Kernel principal component analysis (KPCA) means conducting PCT in feature
space (kernel space). KPCA is applied on the variables which are nonlinearly related
to the input variables. In this section KPCA algorithm has been described through
PCA algorithm.
First m number of TP ( , 1,........,n
ix R i m∈ = ) are chosen. PCA finds the principal
axes by diagonalizing the following covariance matrix,
1
1 m
T
j j
j
C x x
m =
= ∑ (3.10)
The covariance matrix C is positive definite; hence, non-negative eigen values
can be obtained.
v Cvλ = (3.11)
For PCA, first sort the eigen values in decreasing order and find the corresponding
eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this
manner. Now next step is rewriting of PCA in terms of dot product. Now substituting
Eq. (3.10) in Eq. (3.11) 
1
1 m
T
j j
j
Cv x x v v
m
λ
=
= =∑
Thus
35 
 
( )
1
1
1
1
.
m
T
j j
j
m
j j
j
v x x v
m
x v x
m
λ
λ
=
=
=
=
∑
∑
(3.12)
since ( ) ( ). .T
x x v x v x=
In Eq. (3.12), the term ( ).jx v is a scalar. This means that all the solutions v with λ ≠
0 lie in the span of 1,......, mx x , i.e.
1
m
i i
i
v xα
=
= ∑ (3.13)
Steps for KPCA
1. For KPCA, first transform the TPs using a kernel function (Φ ) to feature space
( H ). Data set ( ( ), 1,.....,ix i mΦ = ) in feature space are assumed as centered to
reduce the complexity of calculation. The covariance matrix in H of the data
set takes the form as following
1
1
( ) ( )
m
T
j j
j
C x x
m =
= Φ Φ∑ (3.14)
2. Find the eigen values 0λ ≥ and corresponding non zero eigen vectors
 {0}v H∈ of the covariance matrix C from the equation,
v Cvλ = (3.15)
3. As shown in previously (for PCA), all solution of v ( 0λ ≠ ) lie in the span of
1( ),........, ( )mx xΦ Φ , i.e.,
1
( )
m
i i
i
v xα
=
= Φ∑ (3.16)
Therefore,
1
( )
m
i i
i
Cv v xλ λ α
=
= = Φ∑ (3.17)
Substituting Eq. (3.14) and eq. 3.16 in Eq. (3.17)
1 1 1
( ) ( ) ( ) ( )
m m m
T
j j j i i j
j i j
m x x x xλ α α
= = =
Φ = Φ Φ Φ∑ ∑∑ (3.18)
4. Define kernel inner product by ( , ) ( ) ( )T
i j i jK x x x x= Φ Φ . Substituting this in Eq.
(3.18) following equation is obtained.
36 
 
1 1 1
( ) ( ) ( , )
m m m
j j j i i j
j i j
m x x K x xλ α α
= = =
Φ = Φ∑ ∑∑ (3.19)
5. To express the relationship in Eq. (3.19) entirely in terms of the inner-product
kernel, premultiply both sides by ( )T
kxΦ for all k = 1,……,m. Define the m ×m
matrix K, called the kernel matrix, whose ijth element is the inner-product
kernel , ( , )i jK x x . The vector α of length m, whose jth element is the coefficient
jα .
6. Finally, Eq. (3.19) can be written as,
1 1 1
1
( ) ( ) ( ) ( ) ( ) ( )
1,2,....,
m m m
T T T
j k j j k i i j
i i j
x x x x x x
m
k m
λ α α
= = =
Φ Φ = Φ Φ Φ Φ
∀ =
∑ ∑∑ (3.20)
Now Eq. (3.20) can be transformed as (using ( , ) ( ) ( )T
i j i jK x x x x= Φ Φ ),
2
m K Kλ α α= (3.21)
To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be
solved,
m Kλα α=
(3.22)
7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel
matrix K. Let 1 2 ........ mλ λ λ≥ ≥ ≥ be the eigen values of K and 1 2, ,......., mβ β β be
the corresponding set of eigen vectors with pλ being the last non zero eigen
value.
 
Figu
8. To
eige
H. T
9. In t
it is
H (
feat
equ
Figure-3.5
 
(
ure 3.4: (a
Th
on
extract pr
en vectors β
Then
the above a
s certainly
Schölkopf,
ture space
ation for k
,i jK
5 provides t
(a)
a) Input po
he three g
nly (Wikipe
incipal com
nβ in H (n
β
algorithm,
difficult to
2004). Th
. However
kernel PCA
( 1mK K= −
the outline
 
oints befor
groups are
edia, 2010)
mponent, i
1,...., p= ).
, ( )n xβ Φ = ∑
it has bee
o obtain th
herefore, it
r, there is
A. It is need
1 1m mK K− +
e of KPCA a
37
e kernel P
e distingui
).
it is neede
Let x be a
1
( )
m
n i
i
xβ
=
Φ∑
n assumed
he mean of
is problem
a way to
ded to diago
) ,
1 Wm m i j
K
algorithm.
PCA (b) Ou
shable usi
ed to comp
a test point
), ( )xΦ
d that the d
f the mappe
matic to cen
o do it by
onalize the
Where (1 )m ij
 
(b)
utput after
ing the fir
pute projec
t, with an i
data set is
ed data in
nter the m
slightly m
e kernel ma
1
: ,i j
m
= ∀
 
r kernel P
rst compon
ction onto
image (xΦ
(3.2
centered,
feature sp
mapped data
modifying
atrix K,
(3.2
CA.
nent
the
) in
23)
but
pace
a in
the
24)
38 
 
 
Figure 3.5: Outline of KPCA algorithm
3.2.4 Orthogonal subspace projection (OSP)
subspace projection is to eliminate all unwanted or undesired spectral
signatures (background) within a pixel, then use a matched filter to extract the
desired spectral signature (endmember) present in that pixel.
 
39 
 
3.2.4.1 Automated target generation process algorithm (ATGP)
In hyperspectral image analysis a pixel may encompass many different
materials; such pixels are called mixed pixels. It contains multiple spectral
signatures. Let a column vector ir represent the mixed pixel by linear model,     
i i ir M nα= + (3.25)
where the vector ir is a 1l × column vector, represents the ith mixed pixel. l is the
number of spectral bands. Each distinct material in the mixed pixel is called an
endmember (p). Assume that there are p spectrally distinct endmembers in the ith
mixed pixel. M is a matrix of dimension l p× , is made up of linearly independent
columns. These columns are denoted by ( )1 2, ,......, ,.......,j pm m m m . Here this system is
considered as over determined (l p> ) system and jm denotes the spectral signature of
the jth distinct material or endmember. Let α be a p column vector given by
( )1 2, ,......, ,......,
T
j pα α α α where the jth element represents the fraction of the jth
signature as present in the ith mixed pixel. ni is a 1l × column vector presenting the
white Gaussian noise with zero mean and covariance matrix 2
Iσ where I is an l l×
identity matrix.
In the Eq. (3.25), assume ir ’s are a linear combination of p endmembers with
the weight coefficients designated by the fraction vector iα . The term iMα has been
rewritten to separate the desired spectral signatures from the undesired signatures.
In other way, targets are being separated from background. In searching for a single
spectral signature this can be written as:     
pM d Uα α γ= + (3.26)
Where d is l l× matrix, the desired signature of interest containing column vector mp
while pα is 1 1× , the fraction of the desired signature. The matrix U is composed of
the remaining column vectors from M. These are the undesired spectral signatures or
background information. This is given by ( )1 2 , 1, ,....., ........,j pU m m m m −= with
dimension ( 1)l p× − where γ is a column vector containing rest of ( )1p − components
(fractions) of α
40 
 
Suppose P is an operator, which eliminates the effects of U, the undesired
signatures. To do this, an operator (orthogonal subspace operator) has been developed
that projects r onto a subspace that is orthogonal to the columns of U. This results in
a vector that only contains energy associated with the target d and noise n. The
operator used is the l l× matrix     
( )1
1 ( )T T
P U U U U−
= − (3.27)
The operator P maps d into a space orthogonal to the space spanned by the
uninteresting signatures in U. Now apply the operator P on the mixed pixel r from
Eq. (3.25)   
Pr pPd PU Pnα γ= + + (3.28)
It should be noticed that P operating on Uγ reduces the contribution of U to zero
(close to zero in real data applications). Therefore, from above rearrangement we
have   
Pr pPd Pnα= + (3.29)
3.2.4.1 Signal-to-Noise Ratio (SNR) Maximization
The second step in deriving the pixel classification operator is to find the 1 l×
operator T
X that maximizes the SNR. Operating on Eq. (3.28) get
PrT T T T
pX X Pd X PU X Pnα γ= + + (3.30)
The operator T
X acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is
given by,     
2T T T
p
T T T
X Pd d P X
X PE nn P X
α
λ =
⎡ ⎤⎣ ⎦
(3.31)
2
2
T T T
p
T T
X Pdd P X
X PP X
α
λ
σ
⎛ ⎞
= ⎜ ⎟⎜ ⎟
⎝ ⎠
(3.32)
where [ ]E denotes the expected value. Maximization of this quotient is the
generalized eigenvector problem  
T T T
Pdd P X PP Xλ= (3.33)
41 
 
where
2
2
p
σ
λ λ
α
⎛ ⎞
= ⎜ ⎟⎜ ⎟
⎝ ⎠
, The value of T
X which maximizes λ can be determined in general
using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and
symmetric properties of the interference rejection operator. As it turns out the value
of T
X which maximizes the SNR is   
T T
X kd= (3.34)
where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is
seen that the overall classification operator for a desired hyperspectral signature in
the presence of multiple undesired signatures and white noise is given by the 1 l×
vector as     
T T
q d p= (3.35)
This result first nulls the interfering signatures, and then uses a matched filter for
the desired signature to maximize the SNR. When the operator is applied to all of the
pixels in a hyperspectral scene, each 1l × pixel is reduced to a scalar which is a
measure of the presence of the signature of interest. The ultimate aim is to reduce the
l images that make-up the hyperspectral image cube into a single image where pixels
with high intensity indicate the presence of the desired signature.
This operator can be easily extended to seek out k signatures of interest. The
vector operator simply becomes a k l× matrix operator which is given by,   
( )1 2, ,...., ,....,j kQ q q q q= (3.36)
When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral
scene, each 1l × pixel is reduced to 1 1× vector. Ultimately, l dimensional
hyperspectral image reduces to single dimensional feature extracted image where
pixels with high intensity indicate the presence of the desired signature. Thus for k
desired signature hyperspectral image can be reduce to k dimensional feature
extracted image. Here each band corresponds to the each desired signature.
The above algorithm is discussed with the following example:
Let us start with three vectors or classes, each six elements or bands long. The
vectors are in reflectance units and can be seen below.
42 
 
0.26 0.07 0.07
0.30 0.07 0.13
0.31 0.11 0.19
0.31 0.54 0.25
0.31 0.55 0.30
0.31 0.54 0.34
Concrete Tree Water
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
= = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels
looks like,
( ) ( ) ( )40 .08 .75 .07pixel concrete tree dirt noise= + + + (3.37).
Let us assume that the noise is zero. If all the pixel mixture fractions have been
defined, particular class spectrum can be chosen to extract from the image. Suppose
the concrete material has to be extracted throughout the image. Same procedure can
be followed to extract grass and tree material.
Assume that 40pixel is made up some weighted linear combination of
endmembers.
40pixel M noiseα= + (3.38)
Now Mα can be break up into desired, dα and undesired, Uγ signatures. Now
assign the desired as d and undesired as U signatures to spectrum. Let concrete be
the vector d and tree and water be the column vectors of the matrix U. However, the
fractions of mixing are unknown to us. But it is known that 40pixel is made up of
some combination of d and U.
,d concrete and U tree water= =⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦
Now it is required to reduce the effect of U. To do this it is needed to find a
projection operator P, that when operated on U, will reduce its contribution to zero.
To find concrete, d, 40pixel is projected onto a subspace that is orthogonal to the
columns of U using the operator P. In other words, P maps d into a space orthogonal
to the space spanned by the undesired signatures while simultaneously minimizing
the effects of U. If P is operated on U, which contains tree and water, then it is seen
that the effect of U is minimized.
43 
 
00
0 0
0 0
0 0
0 0
PU
⎡ ⎤
⎢ ⎥
⎢ ⎥
⎢ ⎥=
⎢ ⎥
⎢ ⎥
⎢ ⎥⎣ ⎦
(3.39)
Now let 1r = 40pixel and n = noise, then from eq. (3.29),
1Pr pPd Pnα= + (3.40)
Now operator T
x needs to find out which will maximizes the signal-to noise
ratio (SNR). The operator T
x acting on 1Pr will produce a scalar. As stated before, the
value of T
x which maximizes the SNR is T T
X kd= . This leads to an overall OSP
operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the
entire data vector can be project along the columns of Q and OSP feature extracted
image is formed.
3.3 Supervised classifier
This section describes the mathematical background of supervised classifiers.
First, it will describe the Bayesian decision rule followed by the decision rule for
Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k-
nearest neighbor (KNN) and Support vector machine (SVM) classification rules.
3.3.1 Bayesian decision rule
In pattern recognition, patterns need to be classified. There are plenty of
decision rules available in literatures but only Bayes Decision Theory is optimal
(Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose
there are K classes and let ( )f x
k
be the distribution function of the kth class, where
0 k K< < , and ( )kP c is the prior probability of the kth classes such that
1
( ) 1
K
k
k
P c
=
=∑ .
For any class k , the posteriori probability for a pixel vector x is denoted by ( )|k kp c x
and defined by (assuming all classes are mutually exclusive):   
1
( | ) ( )
( | )
( ) ( )
k k
k k K
k k
k
k
P x c P c
p c
f P c
=
=
=
∑
x
x
(3.41)
44 
 
Therefore, the Bayes decision rule is:     
( | ) max ( | )i i i k k
k
c if p c p c∈ =x x x (3.41a)
 
3.3.2 Gaussian maximum likelihood classification (GML):
Gaussian maximum likelihood classifier assumes that the distribution of the data points is
Gaussian (normally distributed) and classifies an unknown pixel based on the variance and
covariance of the spectral response patterns. This classification is based on probability density
function associated with training data. Pixels are assigned to the most likely class based on a
comparison of the posterior probability that it belongs to each of the signatures being considered.
Under this assumption, the distribution of a category response pattern can be completely described
by the mean vector and the covariance matrix. With these parameters, the statistical probability of
a given pixel value being a member of a particular land cover class can be computed (Lillesand et
al., 2002). GML classification can obtain minimum classification error under the assumption that
the spectral data of each class is normally distributed. It considers not only the cluster centre but
also its shape, size and orientation by calculating a statistical distance based on the mean values
and covariance matrix of the clusters. The decision boundary for the GML classification is:
( ) 1ˆ ˆˆ ˆ(1 2) ln ( ) ( )T
k k k k
−⎡ ⎤− + − −⎢ ⎥⎣ ⎦
x xΣ Σμ μ
(3.42)
And the final bayesian decision rule is:
( ) max ( )j j k
k
c if g g∈ =x x x
where ( )kg x is the decision boundary function for kth class.
3.3.3 k – nearest neighbor classification
  KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification
technique which has been proven to be effective in pattern recognition. However, its
inherent limitations and disadvantages restrict its practical applications. One of the
shortages is lazy learning which makes the traditional KNN time-consuming. In this
thesis work traditional KNN process has been applied (Fix and Hodges, 1951).
The k-nearest neighbor classifier is commonly based on the Euclidean distance
between a test pixel and the specified TP. The TP are vectors in a multidimensional
feature space, each with a class label. In the classification phase, k is a user-defined
45 
 
constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which
is most frequent among the k training samples nearest to that test pixel.
          
 
 
Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified
either to the first class of squares or to the second class of triangles. If k
= 3, it is classified to the second class because there are 2 triangles and
only 1 square inside the inner circle. If k = 5, it is classified to first class
(3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified
to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009).
Let x be a n -dimensional test pixel and iy ( (1,2.... ))i p= is n -dimensional TP,
Euclidian distance between them is defined by:
2 2 2
11 1 12 2 1( , ) ( ) ( ) ...... ( )i i i i n ind x y x y x y x y= − + − + + − (3.43)
Where 11 12 1( , ...... ),nx x x x= 1 2( , ...... )i i i iny y y y= and 1 2{ , ...... }pD d d d= , p is number of TP
The final KNN decision rule is:
46 
 
j
1 , even
2
if minimum element of D corresponding to c is
, odd
2
j
k
k
x c
k
k
⎧ ⎫⎛ ⎞⎡ ⎤
+⎪ ⎪⎜ ⎟⎢ ⎥
⎪ ⎪⎣ ⎦⎝ ⎠∈ ⎨ ⎬
⎡ ⎤⎪ ⎪
⎢ ⎥⎪ ⎪⎢ ⎥⎩ ⎭
(3.44)
In case of tie, the test pixel is assigned to the class jc if its distance from the mean
vector of the class jc is minimum.
Where ,( 1,2,....., )ik i p= is a user defined parameter which implies the number of
nearest neighbor is chosen for classification. The outline of algorithm of KNN
classification is given in Figure: 3.7
Figure 3.7: Outline of KNN algorithm
3.3.4 Support vector machine (SVM):
The foundations of Support Vector Machines (SVM) have been developed by
Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)
47 
 
principle, which has been shown to be superior, (Gunnet al., 1997), to traditional
Empirical Risk Minimization (ERM) principle, employed by conventional neural
networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM
that minimizes the error on the training data. SVMs were developed to solve the
classification problem, but recently they have been extended to the domain of
regression problems (Vapnik et al., 1997).
SVM is basically a linear learning machine based on the principle of optimal
separation of classes. The aim is to find a hyperplane which linearly separates the
class of interest. The linear separating hyperplane is placed between the classes in
such a way that it satisfies two conditions.
(i) All the data vector that belongs to the same class are placed to the same side of
separating hyperplane.
(ii) Distance between two closest data in both classes is maximized (Vapnik, 1982).
The main aim of SVM is to define an optimum hyperplane between two classes
which will maximize the boundary of two classes. For each class, the data vectors
forming the boundary of classes are called the support vectors (SV) and the
hyperplane is called decision surface (Pal, 2002).
3.3.4.2 Statistical learning theory
The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical
framework for learning from input training with known class and predict the outcome of data point
with unknown identity. The first is called ERM whose aim is to reduce the training error and the
second is called SRM, whose goal is to minimize the upper bound on the expected error on the
whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999).
First, it does not depend on the unknown cumulative distribution function. Secondly, it can be
minimized with respect to the parameter, which is used in decision rule.
3.3.4.2 Vapnik and Charvonenkis dimension (VC-dimension):
VC dimension is a measure of the capacity of a set of classification functions. The
VC-dimension, generally denoted by h, is an integer that represents the largest number of
data points that can be separated by a set of functions fα in all possible ways. For
example, for a arbitrary classification problem, VC-dimension is the maximum
48 
 
number of points, which can be separated into two classes without error in all
possible 2k ways (Varshney and Arora, 2004).
3.3.4.3 Support vector machine algorithm with quadratic optimization
method (SVM_QP): 
The procedure of obtaining a separating hyperplane by SVM is explained for a
simple linearly separable case for two classes which can be separated by a hyperplane
and it can be extended for the multiclass classification problem. This procedure then
can be extended to the case where a hyperplane cannot separate the two classes that
is kernel method for SVM.
Let there are n number of training samples obtained from two classes,
represented as 1 1 1 1( , ),( , ),..........,( , )n nx y x y x y where m
ix R∈ , m is the dimension of the
data vector with each sample belonging to either of the two classes labeled by
{ 1, 1}y∈ − + . These samples are said to be linearly separable if there exists a
hyperplane in m-dimensional space whose orientation is given by a vector w and
whose location is determined by a scalar b as offset of this hyperplane from the origin
(Figure 3.8). In case such a hyperplane exists then the given set of training data
points must satisfy the following inequalities:
1, : 1i iw x b i y⋅ + ≥ + ∀ = + (3.45)
1, : 1i iw x b i y⋅ + ≤ − ∀ = −
(3.46)
Thus, the equation of hyperplane is given by 0iw x b⋅ + = .
49 
 
Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after
Gunn, 1998).
The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality
as:
( . ) 1i iy w x b+ ≥ (3.47)
Thus, the decision rule for the linearly separable case can be defined in the following
form:
( . )i ix sign w x b∈ + (3.48)
Where, (.)sign is the signum function whose value is +1 for any element greater than
or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily
represent the two classes given by labels +1 and –1.
The separating hyperplane (Figure 3.8) will be able to separate the two classes
optimally when its margin from both the classes is equal and maximum (Varshney,
2004) i.e. the hyperplane should be located exactly in the middle of the two classes.
50 
 
The distance ( ; , )D x w b is used to express the margin of separation or margin for a
point x from the hyperplane defined by w and b. It is given by
2
.
( ; , )
w x b
D x w b
w
+
= (3.49)
Where, 2
denotes the second norm which is equivalent to the Euclidean length of
the element vector for which it is being computed and is the absolute function. Let
d be the value of the margin between two separating planes. To maximize the
margin, express the value of d as
2 2
. 1 . 1w x b w x b
d
w w
+ + + −
= −
2
2
w
=
2
T
w w
= (3.49a)
To obtain an optimal hyperplane the margin value (d ) should be maximized i.e.
2
2
w
should be maximized, it is equivalent to minimization of the 2-norm of the vector w.
Thus, the objective function Φ(w) of finding the best separating hyperplane
reduces to    
1
( )
2
T
w w wΦ = (3.50)
A constrained optimization problem can be constructed for minimizing the objective
function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of
constrained optimization problem with a convex objective function of w and linear
constraints is called a primal problem and can be solved using standard Quadratic
Programming (QP) optimization techniques. The QP optimization technique can be
implemented by replacing the inequalities in a simpler form by transforming the
problem into a dual space representation using Lagrange multipliers ( iλ )
(Leunberger, 1984). The vector w can be defined in terms of Lagrange multipliers ( iλ )
as shown:
51 
 
1
1
,
0
t
n
i i i
i
n
i i
i
w y x
y
λ
λ
=
=
=
=
∑
∑
(3.51)
The dual optimization problem reduced by Lagrange’s multipliers ( λi ) thus
becomes     
1 1 1
1
max ( , , ) ( )
2
n n n
i i j j i i j
i i j
L w b y y x x
λ
λ λ λ λ
= = =
= − ⋅∑ ∑∑ (3.52)
Subjected to the constraints:
1
0
n
i i
i
yλ
=
=∑ (3.53)
0, 1,2,...,i i nλ ≥ = (3.54)
Solution of the optimization problem is obtained in terms of Lagrange’s
multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor,
2000) some of the Lagrange’s multiplier will be zero. The multipliers which have
nonzero values are called SVs. The result from an optimizer, also called as an optimal
solution, will be a set of unique and independent multipliers: 1 2( , ,..., )s
o o o o
nλ λ λ λ=
where, sn is the number of support vectors found. Substituted these in Eq. (3.51) to
obtain the orientation of optimal separating hyperplane ( o
w ) as   
0 0
1
n
i i i
i
w y xλ
=
= ∑ (3.55)
The offset from origin ( 0
b ) is determined from the equation given below,   
0 0 0 0 0
1 1
1
2
b w x w x+ −
⎡ ⎤= +⎣ ⎦ (3.56)
Where 0
1x+ and 0
1x− are support vector of class labels +1 and -1 respectively. The
following decision rule (obtained from Eq. (3.48)) is then applied to classify the data
vectors into two classes +1 and -1:   
0 0
support vectors
( ) ( ( . ) )i i if x sign y x x bλ= +∑ (3.57)
Eq. (3.57) implies that
0 0
support vectors
( ( . ) )i i ix sign y x x bλ∈ +∑ (3.58)
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis
Thesis

More Related Content

Viewers also liked

Synthetic ice rinks for sale
Synthetic ice rinks for saleSynthetic ice rinks for sale
Synthetic ice rinks for sale
XtraiceOfficial
 
Presenting
PresentingPresenting
Presentinggwschemm
 
Presenting
PresentingPresenting
Presentinggwschemm
 
наши проекты
наши проектынаши проекты
наши проектыChekusova
 
AGE Computer Lab: Pinterest
AGE Computer Lab: PinterestAGE Computer Lab: Pinterest
AGE Computer Lab: Pinterest
Allison Sass
 
Fixing business models by design - By Valentijn Destoop
Fixing business models by design - By Valentijn DestoopFixing business models by design - By Valentijn Destoop
Fixing business models by design - By Valentijn Destoop
Product Design Meetup
 
Las dimensiones de las personas 2
Las dimensiones de las personas 2Las dimensiones de las personas 2
Las dimensiones de las personas 2
consueloemprendedora
 
Ilmu sosial dasar
Ilmu sosial dasarIlmu sosial dasar
Ilmu sosial dasar
Arif Kadarmanto P
 
WP 3 - Monitoring Kanban execution
WP  3 - Monitoring Kanban executionWP  3 - Monitoring Kanban execution
WP 3 - Monitoring Kanban executionVikram Abrol , PMP
 
Magazine 2 analysis
Magazine 2 analysisMagazine 2 analysis
Magazine 2 analysis
Kim Brilus
 

Viewers also liked (13)

Synthetic ice rinks for sale
Synthetic ice rinks for saleSynthetic ice rinks for sale
Synthetic ice rinks for sale
 
PANAMA CDC
PANAMA CDCPANAMA CDC
PANAMA CDC
 
Presenting
PresentingPresenting
Presenting
 
Presenting
PresentingPresenting
Presenting
 
наши проекты
наши проектынаши проекты
наши проекты
 
AGE Computer Lab: Pinterest
AGE Computer Lab: PinterestAGE Computer Lab: Pinterest
AGE Computer Lab: Pinterest
 
Fixing business models by design - By Valentijn Destoop
Fixing business models by design - By Valentijn DestoopFixing business models by design - By Valentijn Destoop
Fixing business models by design - By Valentijn Destoop
 
Las dimensiones de las personas 2
Las dimensiones de las personas 2Las dimensiones de las personas 2
Las dimensiones de las personas 2
 
Ilmu sosial dasar
Ilmu sosial dasarIlmu sosial dasar
Ilmu sosial dasar
 
1028
10281028
1028
 
WP 3 - Monitoring Kanban execution
WP  3 - Monitoring Kanban executionWP  3 - Monitoring Kanban execution
WP 3 - Monitoring Kanban execution
 
GOOGLE INBOX
GOOGLE INBOXGOOGLE INBOX
GOOGLE INBOX
 
Magazine 2 analysis
Magazine 2 analysisMagazine 2 analysis
Magazine 2 analysis
 

Similar to Thesis

Determination of solder paste inspection tolerance limits for fine pitch pack...
Determination of solder paste inspection tolerance limits for fine pitch pack...Determination of solder paste inspection tolerance limits for fine pitch pack...
Determination of solder paste inspection tolerance limits for fine pitch pack...Krishna Chaitanya Chintamaneni
 
final_strategy_primer_clean (1).pdf
final_strategy_primer_clean (1).pdffinal_strategy_primer_clean (1).pdf
final_strategy_primer_clean (1).pdf
Fajar Baskoro
 
MUHUMUZA ONAN
MUHUMUZA ONANMUHUMUZA ONAN
MUHUMUZA ONAN
MUHUMUZAONAN1
 
Internship Final Report
Internship Final Report Internship Final Report
Internship Final Report
Nadia Nahar
 
Design and implementation of online examination suppervision title page
Design and implementation of online examination suppervision title pageDesign and implementation of online examination suppervision title page
Design and implementation of online examination suppervision title page
Bateren Joseph
 
Ojt Narrative Report
Ojt Narrative ReportOjt Narrative Report
Ojt Narrative Report
Arvin Dela Cruz
 
IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...
IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...
IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...Chimwani George
 
Designing web based information architecture for information sharing and inte...
Designing web based information architecture for information sharing and inte...Designing web based information architecture for information sharing and inte...
Designing web based information architecture for information sharing and inte...Anteneh Nigatu
 
Student Work Experience Programme (SWEP 1) Technical Report by Michael Agwulonu
Student Work Experience Programme (SWEP 1) Technical Report by Michael AgwulonuStudent Work Experience Programme (SWEP 1) Technical Report by Michael Agwulonu
Student Work Experience Programme (SWEP 1) Technical Report by Michael Agwulonu
Michael Agwulonu
 
final_diagnostic_report_clean.pdf
final_diagnostic_report_clean.pdffinal_diagnostic_report_clean.pdf
final_diagnostic_report_clean.pdf
Fajar Baskoro
 
The Impact of Supply Chain Collaboration on Operational Performance in the Co...
The Impact of Supply Chain Collaboration on Operational Performance in the Co...The Impact of Supply Chain Collaboration on Operational Performance in the Co...
The Impact of Supply Chain Collaboration on Operational Performance in the Co...TABE Shadrack A.
 
Electronic Student Record Management System
Electronic Student Record Management SystemElectronic Student Record Management System
Electronic Student Record Management System
34090000
 
Dhruv Rai - Master's Thesis
Dhruv Rai - Master's ThesisDhruv Rai - Master's Thesis
Dhruv Rai - Master's ThesisDhruv Rai
 
140325 alexandros.zografos-with-cover
140325 alexandros.zografos-with-cover140325 alexandros.zografos-with-cover
140325 alexandros.zografos-with-cover
madhu ck
 
E votingproposal
E votingproposalE votingproposal
E votingproposal
Morine Gakii
 
Pharma statistic 2018
Pharma statistic 2018Pharma statistic 2018
Pharma statistic 2018
Majdi Ayoub
 
How to build an experience differentiation strategy for software business. Cu...
How to build an experience differentiation strategy for software business. Cu...How to build an experience differentiation strategy for software business. Cu...
How to build an experience differentiation strategy for software business. Cu...
VTT Technical Research Centre of Finland Ltd
 

Similar to Thesis (20)

Combined_Doc
Combined_DocCombined_Doc
Combined_Doc
 
Determination of solder paste inspection tolerance limits for fine pitch pack...
Determination of solder paste inspection tolerance limits for fine pitch pack...Determination of solder paste inspection tolerance limits for fine pitch pack...
Determination of solder paste inspection tolerance limits for fine pitch pack...
 
final_strategy_primer_clean (1).pdf
final_strategy_primer_clean (1).pdffinal_strategy_primer_clean (1).pdf
final_strategy_primer_clean (1).pdf
 
MUHUMUZA ONAN
MUHUMUZA ONANMUHUMUZA ONAN
MUHUMUZA ONAN
 
Internship Final Report
Internship Final Report Internship Final Report
Internship Final Report
 
Design and implementation of online examination suppervision title page
Design and implementation of online examination suppervision title pageDesign and implementation of online examination suppervision title page
Design and implementation of online examination suppervision title page
 
Ojt Narrative Report
Ojt Narrative ReportOjt Narrative Report
Ojt Narrative Report
 
Dinesh Babu Thava 0702 mba2003 fsm final project report
Dinesh Babu Thava 0702 mba2003 fsm   final project reportDinesh Babu Thava 0702 mba2003 fsm   final project report
Dinesh Babu Thava 0702 mba2003 fsm final project report
 
IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...
IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...
IMPROVING FINANCIAL AWARENESS AMONG THE POOR IN KOOJE SLUMS OF MERU TOWN-FINA...
 
Designing web based information architecture for information sharing and inte...
Designing web based information architecture for information sharing and inte...Designing web based information architecture for information sharing and inte...
Designing web based information architecture for information sharing and inte...
 
Student Work Experience Programme (SWEP 1) Technical Report by Michael Agwulonu
Student Work Experience Programme (SWEP 1) Technical Report by Michael AgwulonuStudent Work Experience Programme (SWEP 1) Technical Report by Michael Agwulonu
Student Work Experience Programme (SWEP 1) Technical Report by Michael Agwulonu
 
final_diagnostic_report_clean.pdf
final_diagnostic_report_clean.pdffinal_diagnostic_report_clean.pdf
final_diagnostic_report_clean.pdf
 
The Impact of Supply Chain Collaboration on Operational Performance in the Co...
The Impact of Supply Chain Collaboration on Operational Performance in the Co...The Impact of Supply Chain Collaboration on Operational Performance in the Co...
The Impact of Supply Chain Collaboration on Operational Performance in the Co...
 
Thesis_final_subm
Thesis_final_submThesis_final_subm
Thesis_final_subm
 
Electronic Student Record Management System
Electronic Student Record Management SystemElectronic Student Record Management System
Electronic Student Record Management System
 
Dhruv Rai - Master's Thesis
Dhruv Rai - Master's ThesisDhruv Rai - Master's Thesis
Dhruv Rai - Master's Thesis
 
140325 alexandros.zografos-with-cover
140325 alexandros.zografos-with-cover140325 alexandros.zografos-with-cover
140325 alexandros.zografos-with-cover
 
E votingproposal
E votingproposalE votingproposal
E votingproposal
 
Pharma statistic 2018
Pharma statistic 2018Pharma statistic 2018
Pharma statistic 2018
 
How to build an experience differentiation strategy for software business. Cu...
How to build an experience differentiation strategy for software business. Cu...How to build an experience differentiation strategy for software business. Cu...
How to build an experience differentiation strategy for software business. Cu...
 

Thesis

  • 1.           SU DEPA INDIAN UPERV HYPE SOUM ARTME N INSTITU VISED L ERSPE B MYADI NT OF C UTE OF July i LEARN CTRAL By IP CHA CIVIL E TECHN y 2010 ING W L DATA ANDRA ENGINEE NOLOGY WITH A ERING KANPUUR
  • 2. A D   SU Dissertat DEPA INDIAN UPERV HYPE tion Sub Require Ma SOUM ARTME N INSTITU VISED L ERSPE bmitted ements aster of B MYADI (Y81 NT OF C UTE OF July i LEARN CTRAL In Parti for the D Techno By IP CHA 103044) CIVIL E TECHN y 2010 ING W L DATA ial Fulfil Degree o ology ANDRA ENGINEE NOLOGY WITH A llment o of ERING KANPU of the UR
  • 4. iii    ABSTRACT Hyperspectral data (HD) has ability to provide large amount of spectral information than multispectral data. However, it suffers from problems like curse of dimensionality and data redundancy. The size of data set is also very large. Consequently, it is difficult to process these datasets and obtain satisfactory classification results. The objectives of this thesis are to find the best feature extraction (FE) techniques and improvement in accuracy and time for classification of HD by using parametric (Gaussian maximum likely hood (GML)), non-parametric (k- nearest neighborhood (KNN)) and support vector machine (SVM) algorithm. In order to achieve these objectives, experiments were performed with different FE techniques like segmented principal component analysis (SPCA), kernel principal component analysis (KPCA), orthogonal subspace projection (OSP) and projection pursuit (PP). DAIS-7915 hyperspectral sensor data set was used for investigations in this thesis work. From the experiments performed with the parametric and non-parametric classifier, the GML classifier was found gave the best results with an overall kappa value (k-value) 95.89%. This was achieved by using 300 training pixels (TP) per class and 45 bands on SPCA feature extracted data set. SVM algorithm with quadratic programming (QP) optimizer gave the best results amongst all optimizers and approaches. The overall k-value of 96.91% was achieved by using 300 TP per class and 20 bands of SPCA feature extracted data set. However, the supervised FE techniques like KPCA and OSP failed to improve results obtained by SVM significantly. The best results obtained for GML, KNN and SVM were compared by the one-tailed hypothesis testing. It was found that SVM classifier performed significantly better than the GML classifiers for statistically large set of TP (300). For statistically exact (100) and sufficient (200) set of TP, the performance of SVM on SPCA extracted data set is statistically not better than the performance of GML classifier.
  • 5. iv    ACKNOWLEDGEMENTS I express my deep gratitude to my thesis supervisor, Dr. Onkar Dikshit for his involvement, motivation and encouragement throughout and beyond the thesis work. His expert directions have inculcated in my qualities which I will treasure throughout my life. His patient hearing, critical comments approach to the research problem made me do better every time. His valuable suggestions to all stages of the thesis work helped me to improvise various sorts of my shortcomings of my thesis work. I also express my sincere thanks for his effort in going through the manuscript carefully and making it more readable. It has been a great learning and life changing experience working with him. I would like to express my sincere tribute to Dr. Bharat Lohani for his friendly nature, excellent guidance and teaching during my stay at IITK. I would like to thank specially to Sumanta Pasari for his valuable comments and corrections of the manuscript of my thesis. I would like to thank all of my friends, especially Shalabh, Pankaj, Amar, Saurabh, Chotu, Manash, Kunal, Avinash, Anand, Sharat, Geeta and all other GI peoples especially Shitlaji, Mauryaji, Mishraji who made my stay a very joyous, pleasant and memorable one. In closure, I express my cordial homage to my parents and my best friend for their unwavering support and encouragement to complete my study at IITK SOUMYADIP CHANDRA July 2010          
  • 6. v        CONTENTS CERTIFICATE………………………………………………………………………….. ii  ABSTRACTS........................................................................................................... iii  ACKNOWLEDGEMENTS……………………………………………………………. iv  CONTENTS………………………………………………………………………………...v LIST OF TABLES………………………………………………………………………..ix LIST OF FIGURES..................................................................................................x LIST OF ABBREVIATIONS…………………………………………………………xiii CHAPTER 1 - Introduction.........................................................................1 1.1 High dimensional space.......................................................................................2 1.1.1 What is hyperspectral data?.........................................................................2 1.1.2 Characteristics of high dimensional space..................................................3 1.1.3 Hyperspectral imaging .................................................................................4 1.2 What is classification? .........................................................................................5 1.2.1 Difficulties in hyperspectral data classification..........................................5 1.3 Background of work.............................................................................................6 1.4 Objectives .............................................................................................................7 1.5 Study area and data set used..............................................................................7 1.6 Software details ...................................................................................................9
  • 7. vi    1.7 Structure of thesis ...............................................................................................9 CHAPTER 2 – Literature Review........................................................10 2.1 Dimensionality reduction by feature extraction..................................................10 2.1.1 Segmented principal component analysis (SPCA)........................................11 2.1.2 Projection pursuit (PP) ...............................................................................11 2.1.3 Orthogonal subspace projection (OSP) .....................................................12 2.1.4 Kernel principal component analysis (KPCA) .........................................12 2.2 Parametric classifiers........................................................................................13 2.2.1 Gaussian maximum likelihood (GML).......................................................13 2.3 Non–parametric classifiers ..............................................................................14 2.3.1 KNN .............................................................................................................14 2.3.2 SVM..............................................................................................................15 2.4 Conclusions from literature review ..................................................................19 CHAPTER 3 – Mathematical Background...................................21 3.1 What is kernel? ..................................................................................................21 3.2 Feature extraction techniques ..........................................................................24 3.2.1 Segmented principal component analysis (SPCA)....................................25 3.2.2 Projection pursuit (PP) ...............................................................................27 3.2.3 Kernel principal component analysis (KPCA) ..........................................34 3.2.4 Orthogonal subspace projection (OSP) ......................................................38
  • 8. vii    3.3 Supervised classifier..........................................................................................43 3.3.1 Bayesian decision rule ................................................................................43 3.3.2 Gaussian maximum likelihood classification (GML):...............................44 3.3.3 k – nearest neighbor classification.............................................................44 3.3.4 Support vector machine (SVM): .................................................................46 3.4 Analysis of classification results.......................................................................58 3.4.1 One tailed hypothesis testing.....................................................................59 CHAPTER 4 - Experimental Design..................................................61 4.1 Feature extraction technique............................................................................62 4.1.1 SPCA ............................................................................................................62 4.1.2 PP .................................................................................................................62 4.1.3 KPCA............................................................................................................63 4.1.4 OSP...............................................................................................................64 4.2 Experimental design..........................................................................................64 4.3 First set of experiment (SET-I) using parametric and non-parametric classifier........................................................................................................................66 4.4 Second set of experiment (SET-II) using advance classifier...............................67 4.5 Parameters......................................................................................................68 CHAPTER 5 - Results ....................................................................................69 5.1 Visual inspection of feature extraction techniques .........................................69
  • 9. viii    5.2 Results for parametric and non-parametric classifiers...................................75 5.2.1 Results of classification using GML classifier (GMLC) ...........................75 5.2.2 Class-wise comparison of result for GMLC...............................................81 5.2.3 Classification results using KNN classifier (KNNC) ................................82 5.2.4 Class wise comparison of results for KNNC .............................................91 5.3 Experiment results for SVM based classifiers.................................................92 5.3.1 Experiment results for SVM_QP algorithm..............................................93 5.3.2 Experiment results for SVM_SMO algorithm...........................................97 5.3.3 Experiment results for KPCA_SVM algorithm.......................................100 5.3.4 Class wise comparison of the best result of SVM ...................................103 5.3.5 Comparison of results for different SVM algorithms .............................104 5.4 Comparison of best results of different classifiers.........................................105 5.5 Ramifications of results...................................................................................107 CHAPTER 6 - Summary of Results and Conclusions .......109 6.1 Summary of results..........................................................................................109 6.2 Conclusions.......................................................................................................112 6.3 Recommendations for future work .................................................................112 REFERENCES………………………………………………….……………….115 APPENDIX A……………………………………………………………………..120   
  • 10. ix    LIST OF TABLES Table Title Page 2.1 Summary of literature review 18 3.1 Examples of common kernel functions 23 4.1 List of parameters 68 5.1 The time taken for each FE techniques 71 5.2 The best kappa values and z-statistic (at 5% significance values) for GML 80 5.3 Ranking of FE techniques and time required to obtain the best k- value 80 5.4 Classification with KNNC on OD and feature extracted data set 84 5.5 The best k-values and z-statistic for KNNC 89 5.6 Rank of FE techniques and time required to obtain best k-value 90 5.7 The best kappa accuracy and z-statistic for SVM_QP on different feature modified data set 95 5.8 The best k-value and z-statistic for SVM_SMO on OD and different feature modified data set 100 5.9 The best k-value and z-statistic for KPCA_SVM on original and different feature modified data sets 104 5.10 Comparison of the best k-values with different FE techniques, classification time, and z-statistic for different SVM algorithms 106 5.11 Statistical comparison of different classifier’s results obtained for different data sets 107 5.12 Ranking of different classification algorithms depending on classification accuracy and time. (Rank: 1 indicate the best) 109
  • 11. x    LIST OF FIGURES Figure Title Page 1.1 Hyperspectral image cube 2 1.2 Fractional volume of a hypersphere inscribed in hypercube decrease as dimension increases 4 1.3 Study area in La Mancha region, Madrid, Spain (Pal, 2002 8 1.4 FCC obtained by first 3 principal components and superimposed reference image showing training data available for classes identified for study area 8 1.5 Google earth image of study area 9 3.1 Overview of FE methods 24 3.2 Formation of blocks for SPCA 26 3.2a Chart of multilayered segmented PCA 27 3.3 Layout of the regions for the chi-square projection index 30 3.4 (a) Input points before kernel PCA (b) Output after kernel PCA. The three groups are distinguishable using the first component only 37 3.5 Outline of KPCA algorithm 38 3.6 KNN classification scheme 45 3.7 Outline of KNN algorithm 46 3.8 Linear separating hyperplane for linearly separable data 49 3.9 Non-linear mapping scheme 52 3.10 Brief description of SVM_QP algorithm 54 3.11 Overview of KPCA_SVM algorithm 58 3.12 Definitions and values used in applying one-tail hypothesis testing 60 4.1 SPCA feature extraction method 62
  • 12. xi    4.2 Projection pursuit feature extraction method 63 4.3 KPCA feature extraction method 63 4.4 OSP feature extraction method 64 4.5 Overview of classification procedure 66 4.6 Experimental scheme for Set-I experiments 67 4.7 The experimental scheme for advanced classifier (Set-II) 68 5.1 Correlation image of the original data set consisting of three blocks having bands 32, 6 and 27 respectively 70 5.2 Projection of the data points. (a) Most interesting projection direction (b) Second most interesting projection direction 71 5.3 First six Segmented Principal Components (SPCs) (b) shows water body and salt lake 72 5.4 First six Kernel Principal Components (KPCs) obtained by using 400 TP 72 5.5 First six features obtained by using eight end-members 73 5.6 Two components of most interesting projections 73 5.7 Correlation images after applying various feature extraction techniques 74 5.8 Overall kappa value observed for GML classification on different feature extracted data sets using selected different bands 78 5.9 Comparison of kappa values and classification times for GML classification method 81 5.10 Best producer accuracy of individual classes observed for GMLC on different feature extracted data set with respect to different set of TP 82 5.11 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 25 TP 85 5.12 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 100 TP 86 5.13 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 200 TP 87 5.14 Overall accuracy observed for KNN classification of OD and feature extracted data sets for 300 TP 88 5.15 Time comparison for KNN classification. Time for different bands 91
  • 13. xii    at different neighbors for (a) 300 TP (b) 200 TP training data per class 5.16 Comparison of best k-value and classification time for original and feature extracted data set 91 5.17 Class wise accuracy comparison of OD and different feature extracted data for KNNC 92 5.18 Overall kappa values observed for classification of FE modified data sets using SVM and QP optimizer 94 5.19 Classification time comparison using 200 and 300 TP per class 97 5.20 Overall kappa values observed for classification of original and FE modified data sets using SVM with SMO optimizer 100 5.21 Comparison of classification time different set of TPs with respect to number of bands for SVM_SMO classification algorithm 101 5.22 Overall kappa values observed for classification original and feature modified data sets using KPCA_SVM algorithm. 103 5.23 Comparison of classification accuracy of individual classes for different SVM algorithms 105
  • 14. xiii    LIST OF ABBREVIATIONS AC DAFE DAIS DBFE FE GML HD ICA KNN k-value KPCA KPCA_SVM MS NWFE Ncri OD OSP PCA PCT PP rbf SPCA SV SVM SVM_QP Advance classifier Discriminant analysis feature extraction Digital airborne imaging spectrometer Decision boundary feature extraction Feature extraction Gaussian maximum likelihood Hyperspectral data Independent component analysis k-nearest neighbors Kappa value Kernel principal component analysis Support vector machine with Kernel principal component analysis Multispectral data Nonparametric weighted feature extraction Critical value Original data Orthogonal subspace projection Principal component analysis Principal component transform Projection pursuit Radial basic function Segmented principal component analysis Support vectors Support vector machine Support vector machine with quadratic programming optimizer
  • 15. xiv    SVM_SMO TP Support vector machine with sequential minimal optimizer Training pixels Dedicated to my family & guide
  • 16. ii    CHAPTER 1 INTRODUCTION Remote sensing technology has brought a new dimension in the field of earth observation, mapping and in many other different fields. At the beginning of this technology, multispectral sensors were used for capturing data. The multispectral sensors capture data in a small number of bands with broad wavelength intervals. Due to few spectral bands, their spectral resolution is insufficient to discriminate amongst many earth objects. But if the spectral measurement is performed by using hundreds of narrow wavelength bands, then several earth objects could be characterized precisely. This is the key concept of hyperspectral imagery. As compared to multispectral (MS) data set, hyperspectral data (HD) has large information content, voluminous and also different in characteristics. So, the extraction of that huge information from HD remains a challenge. Therefore, some cost effective and computationally efficient procedures are required to classify the HD. Data classification is the categorization of data for its most effective and efficient use. As a result of classification, we need a high accuracy thematic map. HD has that potentiality. This chapter will provide the concept of high dimensional space, HD and difficulties in classification of HD. Next part focuses on the objectives of the thesis followed by an overview of data set used in this thesis. Details of the software used are mentioned in the next part of this chapter followed by the structure of thesis. 1.1 High dimensional space In Mathematics, an n-dimensional space is a topological space whose dimension is n (where n is a fixed natural number). One of the typical example is n- dimensional Euclidean space, which describes Euclidean geometry in n-dimensions.
  • 17. 2    n-dimensional spaces with large values of n are sometimes called high-dimensional spaces (Werke, 1876). Many familiar geometric objects can be expressed by some number of dimensions. For example, the two-dimensional triangle and the three- dimensional tetrahedron can be seen as specific instances of the n-dimensional space. In addition, the circle and the sphere are particular form of the n-dimensional hypersphere for n = 2 and n = 3 respectively (Wikipedia, 2010). 1.1.1 What is hyperspectral data? When spectral measurement is done by using hundreds of narrow contiguous wavelength intervals then the captured image is called Hyperspectral image. Mostly, the hyperspectral image is representated by hyperspectral image cube (Figure 1.1). In this cube, x and y axes specify the size of image and λ axis specifies the dimension or the number bands. Hyperspectral sensors corresponding to each band collect information as a set of images. Each image represents a range of the electromagnetic spectrum for each band. Figure 1.1: Hyperspectral image cube (Richards and Jia, 2006) These images are then combined and form a three dimensional hyperspectral cube. As the dimension of the HD is very high, it is comparable with the high dimensional space. HD follows same characteristics like high dimensional space which are described in the following section.
  • 18. 3    1.1.2 Characteristics of high dimensional space High dimensional spaces, spaces with a dimensionality greater than three, have properties that are substantially different from normal sense of distance, volume, and shape. In particular, in a high-dimensional Euclidean space, volume expands far more rapidly with increasing diameter in compared to lower-dimensional spaces, so that, for example: (i). Almost all of the volume within a high-dimensional hypersphere lies in a thin shell near its outer "surface" (ii). The volume within a high-dimensional hypersphere relative to a hypercube of the same width tends to zero as dimensionality tends to infinity, and almost all of the volume of the hypercube is concentrated in its "corners". The above mentioned characteristics have two important consequences for high dimensional data that appear immediately. The first one is, high dimensional space is mostly empty. As a consequence, high dimensional data can be projected to a lower dimensional subspace without losing significant information in terms of separability among the different statistical classes (Jimenez and Landgrebe, 1995). The second consequence of the foregoing is, normally distributed data will have a tendency to concentrate in the tails; similarly, uniformly distributed data will be more likely to be collected in the corners, making density estimation more difficult. Local neighborhoods are almost empty, requiring the bandwidth of estimation to be large and producing the effect of losing detailed density estimation (Abhinav, 2009).
  • 19. 4    Volume fraction: The fraction of the volume of a hypersphere inscribed in a hypercube Figure 1.2: Fractional volume of a hypersphere inscribed in hypercube decreases as dimension increases (Modified after Jimenez, Landgrebe, 1995) 1.1.3 Hyperspectral imaging   Hyperspectral imaging collects and processes information using the electromagnetic spectrum. Hyperspectral imagery makes difference between many types of earth’s objects, which may appear as the same color to the human eye. Hyperspectral sensors look at objects using a vast portion of the electromagnetic spectrum. The whole process of hyperspectral imaging can be divided into three steps: preprocessing, radiance to reflectance transformation and data analysis (Varshney and Arora, 2004). In particular, preprocessing is required to convert the raw radiance to sensor radiance. The processing steps contain the operations like spectral calibration, geometric correction, geo-coding, signal to noise adjustment etc. Radiometric and geometric accuracy of hyperspectral data is significantly different from one band to another band (Varshney and Arora, 2004).
  • 20. 5    1.2 What is classification? Classification means to put data into groups according to their characteristics. In the case of spectral classification, the areas of the image that have similar spectral reflectance are put into same group or class (Abhinav, 2009). Classification is also seen as a means of compressing image data by reducing the large range of digital number (DN) in several spectral bands to a few classes in a single image. Classification reduces this large spectral space into relatively few regions and obviously results in loss of numerical information from the original image. Depending on the availability of information of the region which is imaged, supervised or unsupervised classification methods are performed. 1.2.1 Difficulties in hyperspectral data classification Though it is possible that HD can provide a high accuracy thematic map than MS data, there are some difficulties in classification in case of high dimensional data as listed below: 1. Curse of dimensionality and Hughes phenomenon: It says that when the dimensionality of data set increases with the number of bands, the number of training pixels (TP) required for training a specific classifier should be increased as well to achieve the desired accuracy for classification. It becomes very difficult and expensive to obtain large number of TP for each sub class. This has been termed as “curse of dimensionality” by Bellman (1960), which leads to the concept of “Hughes phenomenon” (Hughes, 1968). 2. Characteristics of high dimensional space: The characteristics of high dimensional space have been discussed in above section (Sec. 1.1.2). For those reasons, the algorithms that are used to classify the multispectral data often fail for hyperspectral data. 3. Large number of highly correlated bands: Hyperspectral sensor uses the large number of contiguous spectral bands. Therefore, among these bands, some bands are highly correlated. These correlated bands do not provide good result in classification. Therefore, the important task is to
  • 21. 6    select the uncorrelated bands or make the bands uncorrelated, applying feature reduction algorithms (Varshney and Arora, 2004). 4. Optimum number of feature: It is very critical to select the optimum number of bands out of large number of bands (e.g. 224 bands for AVIRIS image) to use in classification. Till today there are no suitable algorithms or any rule for selection of optimal number of features. 5. Large data size and high processing time due to complexity of classifier: Hyperspectral imaging system provides large amount of data. So large memory and powerful system is necessary to store and handle the data, generally which is very expensive. 1.3 Background of work This thesis work is the extension of work done by Abhinav Garg (2009) in his M.Tech thesis. In his thesis, he showed that among the conventional classifiers (gaussian maximum likelihood (GML), spectral angle mapper (SAM) and FISHER), GML provides the best result. The performance of GML is improved significantly after applying feature extraction (FE) techniques. Principal component analysis (PCA) was found to be working best, among all FE techniques (discriminant analysis FE (DAFE), decision boundary FE (DBFE), non-parametric weighted FE (NWFE) and independent component analysis(ICA)), in improving classification accuracy of GML. For the advance classifier, SVM’s result does not depend on the choice of parameters but ANN’s does. He also showed SVM’s result was improved by using PCA and ICA techniques while the supervised FE techniques like NWFE and DBFE failed to improve it significantly. He showed some drawbacks for advanced classifier like SVM and suggested some FE techniques which may improve the result for conventional classifier (CC) as well as advanced classifier (AC). However, for large TP (e.g. 300 per class) SVM takes more processing time than small size of TP. The objectives of this thesis work are to sort out these problems and to find the best FE technique, which will improve the classification result for HD. In next article, the objective of this thesis work has been described. .
  • 22. 7    1.4 Objectives This thesis has investigated the following two objectives pertaining to classification with hyperspectral data: Objective-1: To evaluate various FE techniques for classification of hyperspectral data. Objective-2 To study the extent to which advance classifier can reduce problems related to classification of hyperspectral data. 1.5 Study area and data set used The study area for this research is located within an area known as 'La Mancha Alta' covering approximately 8000 sq. km to the south of Madrid, Spain (Fig. 1.4). The area is mainly used for cultivation of wheat, barley and other crops such as vines and olives. HD is acquired by DAIS 7915 airborne imaging spectrometer on 29th June, 2000, at 5 m resolution. Data was collected over 79 wavebands ranging from 0.4 μm to 12.5 μm with an exception of 1.1 μm to 1.4 μm. The first 72 bands in the wavelength range 0.4 μm to 2.5 μm were selected for further analysis (Pal, 2002). Striping problems were observed between bands 41 and 72. All the 72 bands were visually examined and 7 bands (41, 42 and 68 to 72) were found useless due to very severe stripping and were removed. Finally 65 bands were retained and an area of 512 pixels by 512 pixels covering the area of interest was extracted (Abhinav, 2009). The data set available for this research work includes the 65 (retained after pre-processing) bands data and the reference image, generated with the help of field data collected by local farmers as briefed in Pal (2002). The area included in imagery was found to be divided into eight different land cover types, namely wheat, water body, salt lake, hydrophytic vegetation, vineyards, bare soil, pasture lands and built up area.  
  • 23. 8    Figure 1.3: Study area in La Mancha region, Madrid, Spain (Pal, 2002) Figure 1.4: FCC obtained by first 3 principal components and superimposed reference image showing training data available for classes identified for study area (Pal, 2002).
  • 24. 9    Figure 1.5: Google earth image of study area (Google earth, 2007) 1.6 Software details For the processing of HD very power full system is required due to the size of data set and complexity of algorithms. The machine used for this thesis work contains 2.16 GHz Intel processor with 2 GB RAM and operating system Windows 7. Matlab 7.8.0 (R2009a) was used for the coding of different algorithms. All the results are obtained here from same machine for the comparison of different algorithm. 1.7 Structure of thesis The present thesis is organized into six chapters. Chapter1 focuses on the characteristics of high dimensional space, challenges of HD classification and outline of the experiments of this thesis work. Also it discusses the study region, data set and the software used in this thesis work. Chapter 2 presents the detailed description of the HD classification and the previous research work related to this domain. Chapter 3 describes the detailed mathematical background of the different processes used in this work. Chapter 4 outlines the detailed methodology carried out for this thesis work. Chapter 5 presents the experiments which are conducted for this thesis followed by interpretation. Chapter 6 provides the conclusions for present work and the scopes for future works.
  • 25. 10    CHAPTER 2 LITERATURE REVIEW  This chapter outlines the important research works and major achievements in the field of high dimensional data analysis and data classification. The chapter begins with some of the FE techniques and classification approaches, for solving problems related to HD classification as suggested by various researchers. The results of useful experiments with the HD will also be included to highlight the usefulness and reliability of these approaches. These results are presented in tabulated form. Some other issues related to classification of HD are also discussed at the end of this chapter. 2.1 Dimensionality reduction by Swain and Davis (1978) mentioned details of various separability measures for multivariate normal class models. Various statistical classes are found to be overlapping which causes error of misclassification as most of the classifiers use decision boundary approach for classification. The idea was to obtain such a separability measure which could give an overall estimate of range of classification accuracies that can be achieved by using a sub-set of selected features so that the sub-set of features corresponding to highest classification accuracy can be selected for classification (Abhinav, 2009).   FE is the process of transforming the given data from a higher dimensional space to a lower dimensional space while conserving the underlying information (Fukunaga, 1990). The philosophy behind such transformation is to re-distribute the underlying information spread in high dimensional space by containing it into comparatively smaller number of dimensions without loss of significant amount of useful information. FE techniques, in case of classification, try to enhance class separability while reducing data dimensionality (Abhinav, 2009).
  • 26. 11    2.1.1 Segmented principal component analysis (SPCA) The principal component transform (PCT) has been successfully applied in multispectral data for feature reduction. Also it can be used as the tool of image enhancement and digital change detection (Lodwick, 1979). For the case of dimension reduction of HD, PCA outperforms those FE techniques which are based on class statistics (Muasher and Landgrebe, 1983). Further, as the number of TP is limited and ratio to the number of dimension is low for HD, class covariance matrix cannot be estimated properly. To overcome these problems Jia (1996) proposed the scheme for segmented principal component analysis (SPCA) which applies PCT on each of the highly correlated blocks of bands. This approach also reduces the processing time by converting the complete set of bands into several highly correlated bands. Jensen and James (1999) proposed that the SPCA-based compression generally outperforms PCA-based compression in terms of high detection and classification accuracy on decompressed HD. PCA works efficiently for the highly correlated data set but SPCA works efficiently for both high correlated as well as low correlated data sets (Jia, 1996). Jia (1996) compared SPCA and PCA extracted features for target detection and concluded SPCA as a better FE technique than PCA. She also showed that both feature extracted data sets are identical and there is no loss of variance in the middle stages, as long as no components are removed. 2.1.2 Projection pursuit (PP)   Projection pursuit (PP) methods were originally posed and experimented by Kruskal (1969, 1972). PP approach was implemented successfully first by Friedman and Tukey (1974). They described PP as a way of searching for and exploring nonlinear structure in multi-dimensional data by examining many 2-D projections. Their goal was to find interesting views of high dimensional data set. The next stages in the development of the technique were presented by Jones (1983) who, amongst other things, developed a projection index based on polynomial moments of the data. Huber (1985) presented several aspects of PP, including the design of projection indices. Friedman (1987) derived a transformed projection index. Hall (1989) developed an index using methods similar to Friedman, and also developed
  • 27. 12    theoretical notions of the convergence of PP solutions. Posse (1995a, 1995b) introduced a projection index called the chi-square projection pursuit index. Posse (1995a, 1995b) used a random search method to locate a plane with an optimal value of the projection index and combined it with the structure removal of Friedman (1987) to get a sequence of interesting 2-D projections. Each projection found in this manner shows a structure that is less important (in terms of the projection index) than the previous one. Most recently, the PP technique can also be used to obtain 1-D projections (Martinez, 2005). In this research work, Posse’s method is followed that reduces n-dimensional data set to 2-dimensional data. 2.1.3 Orthogonal subspace projection (OSP) Harsanyi and Chang (1994) proposed orthogonal subspace projection (OSP) method which simultaneously reduces the data dimensionality, suppresses undesired or interfering spectral signatures, and detects the presence of a spectral signature of interest. The concept is to project each pixel vector onto a subspace which is orthogonal to the undesired pixel. In order to make the OSP to be effective, number of bands must not be taken less than the number of signatures. It is a big limitation associated with multispectral image. To overcome this, Ren and Chang (2000) presented the Generalized OSP (GOSP) method that relaxes this constraint in such a manner that the OSP can be extended to multispectral image processing in an unsupervised fashion. OSP can be used to classify hyperspectral image (Lentilucci, 2001) and also for magnetic resonance image classification (Wang et.al, 2001). 2.1.4 Kernel principal component analysis (KPCA) Linear PCA always detect all structure in a given data set. By the use of suitable nonlinear feature extractor, more information can be extracted from the data set. The kernel principal component analysis (KPCA) can be used as a strong nonlinear FE method (Scholkopf and Smola, 2002) which maps the input vectors to feature space and then PCA is applied on the mapped vectors. KPCA is also a powerful method for preprocessing steps for classification algorithm (Mika et. al. 1998). Rosipal et.al (2001) proposed the application of the KPCA technique for feature selection in a high-dimensional feature space where input variables were mapped by
  • 28. 13    a Gaussian kernel. In contrast to linear PCA, KPCA is capable of capturing part of the higher-order statistics. To obtain this higher-order statistics, a large number of TP is required. This causes problems for KPCA, since KPCA requires storing and manipulating the kernel matrix whose size is the square of the number of TP. To overcome this problem, a new iterative algorithm for KPCA, the Kernel Hebbian Algorithm (KHA) was introduced by (Scholkopf et. al., 2005). 2.2 Parametric classifiers Parametric classifiers (Fukunaga, 1990) require some parameters to develop the assumed density function model for the given data. These parameters are computed with the help of a set of already classified or labeled data points called training data. It is a subset of given data for which the class labels are known and is chosen by sampling techniques (Abhinav, 2009). It is used to compute some class statistics to obtain the assumed density function for each class. Such classes are referred to as statistical classes (Richards and Jia, 2006) as these are dependent upon the training data and may differ from the actual classes. 2.2.1 Gaussian maximum likelihood (GML) Maximum likelihood method is based on the assumption that the frequency distribution of the class membership can be approximated by the multivariate normal probability distribution (Mather, 1987). Gaussian Maximum Likelihood (GML) is one of the most popular parametric classifiers that has been used conventionally for purpose of classification of remotely sensed data (Landgrebe, 2003). The advantages of GML classification method are that, it can obtain minimum classification error under the assumption that the spectral data of each class is normally distributed and it not only considers the class centre but also its shape, size and orientation by calculating a statistical distance based on the mean values and covariance matrix of the clusters (Lillesand et al., 2002). Lee and Landgrebe (1993) compared the result of GML classifier on PCA and DBFE feature extracted data set and concluded that DBFE feature extracted data set provides better accuracy than PCA feature extracted data set. NWFE and DAFE FE techniques were compared for classification accuracy achieved by nearest neighbor
  • 29. 14    and GML classifiers by Kuo and Landgrebe (2004). They concluded that NWFE is better FE technique than DAFE. Abhinav (2009) investigated the effect of PCA, ICA, DAFE, DBFE and NWFE feature extracted data set on GML classifier. He showed that PCA is the best FE technique for HD among the other mentioned feature extractor for GML classifier.  He also suggested that some FE techniques like KPCA, OSP, SPCA, PP may improve the classification result using GML classifier. 2.3 Non–parametric classifiers The non–parametric classifiers (Fukunaga, 1990) uses some control parameters, carefully chosen by the user, to estimate the best fitting function by using an iterative or learning algorithm. They may or may not require any training data for estimating the PDF. Parzen window (Parzen, 1962) and k–nearest neighbor (KNN) (Cover and Hart, 1967) are two popular working classifiers under this category. Edward (1972) gave brief descriptions of many non-parametric approaches for estimation of data density functions. 2.3.1 KNN KNN algorithm (Fix and Hodges, 1951) has proven to be effective in pattern recognition. The technique can achieve high classification accuracy in problems which have unknown and non-normal distributions. However, it has a major drawback that a large amount of TP is required in the classifiers resulting in high computational complexity for classification (Hwang and Wen, 1998). Pechenizkiy (2005) compared the performance of KNN classifier on the PCA and random projection (RP) feature extracted data set. He concluded that KNN performs well on PCA feature extracted data set. Zhu et. al. (2007) showed that the KNN works better on the ICA feature extracted data set than the original data set (OD) (OD was captured by Hyperspectral imaging system developed by the ISL). ICA- KNN method with a few wavelengths had the same performance as the KNN classifier alone using information from all wavelengths. Some more non–parametric classifiers based on geometrical approaches of data classification were found during literature survey. These approaches consider the data points to be located in the Euclidean space and exploit the geometrical patterns of the data points for classification. Such approaches are grouped into a new class of
  • 30. 15    classifiers known as machine learning techniques. Support Vector Machines (SVM) (Boser et al., 1992), k-nearest neighborhood (KNN) (Fix and Hudges, 1956) are among the popular classifiers of this kind. These do not make any assumptions regarding data density function or the discriminating functions and hence are purely non– parametric classifiers. However, these classifiers also need to be trained using the training data. 2.3.2 SVM SVM has been considered as advance classifier. SVM is a new generation of classification techniques based on Statistical Learning Theory having its origins in Machine Learning and introduced by Boser, Vapnik and Guyon (1992). Vapnik (1995, 1998) discussed SVM based classification in detail. SVM tends to improve learning by empirical risk minimization (ERM) to minimize learning error and to minimize the upper bound on the overall expected classification error by structural risk minimization (SRM). SVM makes use of principle of optimal separation of classes to find a separating hyperplane that separates classes of interest to maximum extent by maximizing the margin between the classes (Vapnik, 1992). This technique is different from that of estimation of effective decision boundaries used by Bayesian classifiers as only data vectors near to the decision boundary (also known as support vectors) are required to find the optimal hyperplane. A linear hyperplane may not be enough to classify the given data set without error. In such cases, data is transformed to a higher dimensional space using a non–linear transformation that spreads the data apart such that a linear separating hyperplane may be found. Kernel functions are used to reduce the computational complexity that arises due to increased dimensionality (Varshney and Arora, 2004). Advantages of SVM (Varshney and Arora, 2004) lie in their high generalization capability and ability to adapt their learning characteristics by using kernel functions due to which they can adequately classify data on a high–dimensional feature space with a limited number of training data sets and are not affected by the Hughes phenomenon and other affects of dimensionality. The ability to classify using even limited number of training samples make SVM as a very powerful classification tool for remotely sensed data. Thus, SVM has the potential to produce accurate classifications from HD with limited number of training samples. SVMs are believed
  • 31. 16    to be better learning machines than neural networks, which tends to overfit classes causing misclassification (Abhinav, 2009), as they rely on margin maximization rather than finding a decision boundary directly from the training samples. For conventional SVM an optimizer is used based on quadratic programming (QP) or linear programming (LP) methods to solve the optimization problem. The major disadvantage of QP algorithm is the storage requirement of kernel matrix in the memory. When the size of the kernel matrix is large enough, it requires huge memory that may not be always available. To overcome this Benett and Campbell (2000) suggested an optimization method which sequentially updates the Lagrange multipliers called the kernel adatron (KA) algorithm. Another approach was decomposition method which updates the Lagrange multipliers in parallel since they update many parameters in each iteration unlike other methods that update parameter at a time (Varshney and Arora, 2004). QP optimizer is used here which updates lagrange multipliers on the fixed size working data set. Decomposition method uses QP or LP optimizer to solve the problem of huge data set by considering many small data sets rather than a single huge data set (Varshney, 2001). The sequential minimal optimization (SMO) algorithm (Platt, 1999) is a special case of decomposition method when the size of working data set is fixed such that an analytical solution can be derived in very few numerical operations. This does not use the QP or LP optimization methods. This method needs more number of iterations but requires a small number of operations thus results in an increase in optimization speed for very large data set. The speed of SVM classification decreases as the number of support vectors (SV) decreases. By using kernel mapping, different SVM algorithms have successfully incorporated effective and flexible nonlinear models. There are some major difficulties for large data set due to calculation of nonlinear kernel matrix. To overcome the computational difficulties, some authors have proposed low rank approximation to the full kernel matrix (Wiens, 92). As an alternative, Lee and Mangasarian (2002) have proposed the method of reduced support vector machine (RSVM) which reduces the size of the kernel matrix. But there was a problem of selecting the number of support vectors (SV). In 2009, Sundaram proposed a method which will reduce the number of SV through the application of KPCA. This method is different from other
  • 32. 17    proposed method as the exact choice of support vector is not important as long as the vector spanned a fixed subspace. Benediktsson et al (2000) applied KPCA on the ROSIS-03 data set. Then he used linear SVM on the feature extracted data set and showed that KPCA features are more linearly separable than the features extracted by conventional PCA. Shah et al (2003) compared SVM, GML and ANN classifiers for accuracies at full dimensionality and using DAFE and DBFE FE techniques on AVIRIS data set and concluded that SVM  gives higher accuracies than GML and ANN for full dimensionality but poor accuracies for features extracted by DAFE and DBFE. Abhinav (2009) compared SVM, GML and ANN with OD and PCA, ICA, NWFE, DBFE, DAFE feature extracted data set. He concluded that SVM provides better result for OD than GML. SVM works best with PCA and ICA feature extracted data set where ANN works better with DBFE and NWFE feature extracted data set. The works done by various researchers with different hyperspectral data sets using different classifiers and FE methods and the results obtained by them is summarized in Table 2.1.  
  • 33. 18    Table 2.1: Summary of literature review Author Dataset used Method used Results obtained Lee and Landgrebe (1993) Field Spectrometer System (airborne hyperspectral sensor) GML classifier is used to compare classification accuracies obtained by DBFE and PCA FE Features extracted by DBFE produces better classification accuracies than those obtained from PCA and Bhattacharya feature selection methods. Jimenez and Landgrebe (1998) Stimulated and real AVIRIS data Hyperspectral data characteristics were studied with respect to effects of dimensionality, order of data statistics used on supervised classification techniques. Hughes phenomenon was observed as an effect of dimensionality and classification accuracy was observed to be increasing with use of higher statistics order. But lower order statistics were observed to be less affected by Hughes phenomenon. Benediktsson et al (2001) ROSIS-03 KPCA and PCA feature extracted data set was used for classification using linear SVM. KPCA features are more linearly separable than features extracted by conventional PCA. Shah et al. (2003) AVIRIS Compared SVM, GML and ANN classifiers for accuracies at full dimensionality and using DAFE and DBFE feature extraction techniques SVM was found to be giving higher accuracies than GML and ANN for full dimensionality but poor accuracies were obtained for features extracted by DAFE and DBFE. Kuo and Landgrebe (2004) Stimulated and real data (HYDICE image of DC mall, Washington, US) NWFE and DAFE FE techniques were compared for classification accuracy achieved by nearest neighbor and GML classifiers. NWFE was found to be producing better classification accuracies than DAFE. Pechenizkiy (2005) 20 data sets with different characteristics were taken from the UCI machine learning repository. KNN classifier was used to compare classification accuracies obtained by PCA and Random Projection FE PCA gave the better result than Random Projection Zhu et al (2007) Hyperspectral imaging system developed by ISL. ICA ranking methods were used to select the optimal wave length the KNN was used. Then KNN alone was used. ICA-KNN method with a few band had the same performance as the KNN classifier alone using all bands. Sundaram (2009) The adult dataset ,part of UCI Machine Learning Repository KPCA was applied in the support vector, then usual SVM algorithm is used Significantly reduce the processing time without effecting the classification accuracy
  • 34. 19    Abhinav (2009) DAIS 7915 GML, SAM, MDM classification techniques were used on the PCA, ICA, NWFE, DBFE and DAFE feature extracted data set GML was the best among the other techniques and performs best on PCA extracted data set. Abhinav (2009) DAIS 7915 SVM and GML classification techniques were used on the OD and PCA, ICA, NWFE, DBFE and DAFE feature extracted data set to compare the accuracy GML performed very low in OD than SVM. SVM provide better accuracy than GML. SVM performs better on PCA and ICA extracted data set. 2.4 Conclusions from literature review   1. From Table 2.1, it can be easily concluded that the FE techniques like PCA, ICA, DAFE, DBFE and NWFE perform well in improving the classification accuracies when used with GML. But the features extracted by DBFE and DAFE failed to improve results obtained by SVM implying a limitation of these techniques for the advance classifiers. KNN works best with PCA and ICA feature extracted data set. However, in the surveyed literature the effects of PP, SPCA, KPCA and OSP extracted features on classification accuracy obtained from the advance classifiers like SVM, parametric classifier like GML and nonparametric classifier KNN have not been observed. 2. Another important aspect found missing in the literature is the comparison of classification time for SVM classifiers because SVM takes long time for training using large TP. It was seen that many approach of SVM were proposed to reduce the classification time but there is no conclusion for the best SVM algorithm depending on classification accuracy and processing time. 3. Although KNN is effective classification technique for HD, there is no guideline for classification time or suggestion of best FE techniques for KNN classifier. Also the effect of different parameters like number of nearest neighbor, number of TP, number of bands is not suggested for KNN.
  • 35. 20    4. During the literature survey, it is further found that there is no suggestion for the best FE techniques for different SVM algorithms, GML and KNN. Such missing aspects will be investigated in this thesis work and the guidelines to choose an efficient and less time consuming classification technique shall be presented as the result of this research.     This chapter presented the FE and classification techniques for mitigating the effects of dimensionality. These techniques were result of different approaches used to deal with the problem of high dimensionality and improving performance of advance, parametric and nonparametric classifier. The approaches were applied on real life HD and comparative results as reported in literature were compiled and presented here. In addition, the important aspects found missing in the literature survey were highlighted which this thesis work shall try to investigate. The mathematical rationale and algorithms used to apply these techniques will be discussed in detail in the next chapter.  
  • 36. 21    CHAPTER 3 MATHEMATICAL BACKGROUND This chapter will provide the detailed mathematical background of each of the techniques used in this thesis. Starting with the some basic concepts of kernels and kernel space this chapter will describe the unsupervised and supervised FE techniques followed by classification and optimization rules for supervised classifier. Finally, the scheme for statistical analysis which has been used for comparing the results of different classification techniques are discussed. Notations which are followed in this chapter for matrix and vector are given below: X A two dimensional matrix, whose columns represent the data points (m) and rows represent number of bands (n), where ,X X n m= ⎡ ⎤⎣ ⎦. ix n -dimensional single pixel column vector where 1 2, ......., mX x x x= ⎡ ⎤⎣ ⎦and 1 2, ,....., T i i i nix x x x= ⎡ ⎤⎣ ⎦ jc Represents jth class. ( )zΦ Mapping of the input vector z in kernel space, using some kernel function. ,a b Defines inner product of the vectors a and b. ∈ Belongs to n R Set of n-dimensional real number. N Set of natural number. T ⎡ ⎤⎣ ⎦ Denotes the transpose of a matrix. ∀ For all. 3.1 What is kernel?   Before defining kernel, let’s look at the following two definitions: • Input space: The space where originally data points lie.
  • 37. 22    • Feature space: The space spanned by the transformed data points (from original space) which were mapped by some functions. Kernel is the dot product in feature space H via a map Φ from input space, such that :X HΦ → . Kernel can be defined as ( , ') ( ), ( ')k x x x x= Φ Φ , where , ' and ( ), ( ')x x x xΦ Φ are the elements of input space and feature space respectively and k is called the kernel and Φ is called feature map associated with k. Φ also can be called as the kernel function. The space containing these dot products is called kernel space. This is a nonlinear mapping from input space to feature space which increases the internal distance between two points in a data set. This means that the data set which is nonlinearly separable in input space becomes linearly separable in kernel space. A few definitions related to kernel are given below: Gram matrix: Given a kernel k and inputs 1 2, ........., nx x x X∈ , the xn n matrix, : ( ( , ))i j ijK k x x= is called the gram matrix of k with respect to 1 2, ........., nx x x X∈ . Positive definite matrix: A real xn n symmetric matrix K satisfying 1 1 0T x Kx > for all ( )1 11 21 1, ,......., T n nx x x x R= ∈ is called positive definite. 1x is a column vector. If the equality in previous equation occurs only for 11 21 1........ 0nx x x= = = = , then the matrix is called strictly positive definite. Positive definite kernel: Let X be a nonempty set. A function :k X X R× → , ∀ , ,in N x X i N∈ ∈ ∈ if it gives rise to a positive definite gram matrix, is called a positive definite kernel. A function :k X X R× → ∀ n N∈ and distinct ix X∈ if it gives rise to a strictly positive definite gram matrix, called strictly positive definite kernel. Definitions of some commonly used kernel functions are shown in Table 3.1.
  • 38. 23    Table 3.1: Examples of common kernel functions (Modified after Varshney and Arora, 2004) Kernel function type Definition ( , )iK x x Parameters Performance depends on Linear ix x× Decision boundary either linear or non linear Polynomial with degree n ( 1)n ix x× + n is a positive integer User defined parameters Radial basis function 2 2 ( - ) exp 2 ix x σ ⎛ ⎞ ⎜ ⎟− ⎜ ⎟ ⎝ ⎠ σ is a user defined value User defined parameters Sigmoid tanh( ( . ) )ik x x + Θ K and Θ are user defined parameter User defined parameters All the above definitions have been explained with the following simple example. Let, 1 2 3 1 2 1 2 1 3 1 1 3 X x x x ⎡ ⎤ ⎢ ⎥= =⎡ ⎤⎣ ⎦ ⎢ ⎥ ⎢ ⎥⎣ ⎦ is a matrix in input space whose columns ( , 1,2,3ix i = ) denote the number of data points and rows denote the dimension of data points. Let, by using Gaussian kernel function, this matrix be mapped in to the feature space. Let ,i jx x denotes the inner product of the columns of the matrix X using Gaussian kernel function. Then the gram matrix (kernel matrix) K takes precisely the form, 1 1 1 2 1 3 2 1 2 2 2 3 3 1 3 2 3 3 , , , , , , , , , x x x x x x K x x x x x x x x x x x x ⎡ ⎤ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ The numerical value of the matrix K is, 1.0000 0.0498 0.0821 0.0498 1.0000 0.6065 0.0821 0.6065 1.0000 K ⎡ ⎤ ⎢ ⎥= ⎢ ⎥ ⎢ ⎥⎣ ⎦ K is symmetric matrix. If the matrix K turns out to be positive definite, then it is called positive definite kernel and if it is strictly positive definite, then it is called strictly positive definite kernel.
  • 39. 24    3.2 Feature extraction techniques FE techniques are based on a simple assumption that given data sample ( : )n x X R∈ belonging to an unknown probability distribution in n-dimensional space can be represented by some coordinate system in m dimensional space (Carreira- Perpinan, 1997). Thus, the FE techniques aim at finding an optimal coordinate system such that when the data points from higher dimensional space are projected onto it, a dimensionally compact representation of these data points will be obtained. There are two following main conditions to obtain an optimal dimension reduction (Carreira-Perpinan, 1997): (i) Elimination of dimensions with very low information content. Features with low information content can be discarded as noise. (ii) Remove redundancy among the dimensions of data space i.e. the reduced feature set should be spanned by orthogonal vectors. The unsupervised and supervised FE techniques have been investigated in this research work (Figure 3.1). For the unsupervised approach, segmented principal component analysis (SPCA), projection pursuit (PP) and for supervised FE technique, kernel principal component analysis (KPCA) and orthogonal subspace projection (OSP) are used. The next sub-sections will discuss the assumptions used by these FE techniques in detail. Figure 3.1: Overview of FE methods
  • 40. 25    3.2.1 Segmented principal component analysis (SPCA)   The principal component transform (PCT) has been successfully applied in multispectral data analysis. It is used as a powerful tool for FE . For hyperspectral image data, PCT outperforms those FE techniques which are based on the class statistics. The main advantage of using a PCT is that global statistics are used to determine the transform functions. Implementation of PCT on high dimensional data set requires high computational load. SPCA can overcome the problem of long processing time by partitioning the complete data set into several highly correlated subgroups (Jia, 1996). The complete data set is first partitioned into K subgroups with respect to the correlation of bands. From the correlation image of HD, it can be seen that blocks are formed from highly correlated bands (Figure 3.2). These blocks are selected as the subgroups. Let 1n , 2n and kn are the number of bands in subgroups 1, 2 and k respectively (Figure 3.2a). Then PCT is applied in each subgroup of data. After applying PCT on each subgroup, significant features are selected by variance information of each component. The PCs which contain about 99% variance were chosen for each block then the selected features can be regrouped and transformed again to compress the data further.
  • 41. 26    Figure 3.2: Formation of blocks for SPCA. Here, 3 blocks, containing 32, 6 and 27 bands respectively, corresponding to highly correlated bands have been formed from the correlation image of HYDICE hyperspectral sensor data. Segmented PCT retains all the variance as with the conventional PCT. There is no information lost either in the case that the transformation is conducted on the complete vector at once or a few sub vectors are transformed separately (Jia, 1996). When the new components obtained from each segmented PCT are gathered and transformed again, then the resulting data variance and covariance are identical to those of the conventional PCT. The main effect is that, the data compression rate is lower in the middle stages compared to the no segmentation case. However, it makes a relatively small difference in compression rate, if segmented transformation is developed on those subgroups which have poor correlation with each other.
  • 42. 27    Figure 3.2a: Chart of multilayered segmented PCA 3.2.2 Projection pursuit (PP) Projection pursuit (PP) refers to a technique first described by Friedman and Tukey (1974) for exploring the nonlinear structure of high dimensional data sets by means of selected low dimensional linear projections. To reach this goal, an objective function is assigned, called projection index, to every projection characterizing the structure present in the projection. Interesting projections are then automatically picked up by optimizing the projection index numerically. The notion of interesting projections has usually been defined as the ones exhibiting departure from normality (normal distribution function) (Diaconis and Freedman, 1984; Huber, 1985). Posse (1990) proposed an algorithm based on a random search and a chi- squared projection index for finding the most interesting plane (two-dimensional view). The optimization method was able to locate in general the global maximum of the projection index over all two-dimensional projections (Posse, 1995). The chi- squared index was efficient, being fast to compute and sensitive to departure from normality in the core rather than in the tail of the distribution. In this investigation only chi-squared (Posse, 1995a, 1995b) projection index has been used.
  • 43. 28    Projection pursuit exploratory data analysis (PPEDA) consists of following two parts: (i) A projection pursuit index measures the degree of departure from normality. (ii) A method for finding the projection that yields the highest value for the index. Posse (1995a, 1995b) used a random search to locate a plane with an optimal value of the projection index and combined it with the structure removal of Friedman (1987) to get a sequence of interesting 2-D projections. The interesting projections are found in decreasing order of the value of the PP index. This implies that each projection found in this manner shows a structure that is less important (in terms of the projection index) than the previous one. In the following discussion, first the chi- squared PP index has been described followed by the structure finding procedure. Finally, the structure removal procedure is illustrated. 3.2.2.1 Posse chi-square index Posse proposed an index based on the chi-square index. The plane is first divided into 48 regions or boxes kB , 1,2,..,48k = that are distributed in the form of rings (Figure 3.3). Inner boxes have the same radial width R/5 and all boxes have the same angular width of 0 45 . R is chosen so that the boxes have approximately the same weight under normally distributed data and which is equal to ( ) 1 22log6 5 . The outer boxes were having weight 1/48 under normally distributed data. This choice for the radial width provides regions with approximately same probability for the standard bivariate normal distribution (Martinez, 2001). The projection index is given as: ( ) ( )2 2 8 48 ( ) ( ) 0 1 1 1 1 1 , , 9 j j k n B i i k j k ik PI I z z c c n α λ β λ χ α β = = = ⎡ ⎤ = −⎢ ⎥ ⎣ ⎦ ∑∑ ∑ (3.1) Where, φ The standard bivariate normal density. kc Probability evaluated over kth region using the normal density function, given by 1 2 k k B c dz dzφ= ∫∫ .
  • 44. 29    kB Box in the projection plane. jλ , 0,.....,8 36 j j π = is the angle by which the data are rotated in the plane before being assigned to regions. ,α β Orthonormal p-dimensional vectors which span the projection plane (It can be first two PCs or randomly chosen two pixels of the OD set). ( , )P α β A plane consists of two orthonormal vectors ,α β ,i jZ Zα β Sphered observations projected onto the vectors andα β . ( T i iZ Zα α= and T i iZ Zβ β= ) ( )jα λ cos sinj jα λ β λ− ( )jβ λ sin cosj jα λ β λ+ kBI The indicator functions for region. ( )2 ,PIχ α β The chi-squareprojection index evaluated using the data projected onto the plane spanned byα and β . The chi-square projection index is not affected by the presence of outliers. However, it is sensitive to distributions that have a hole in the core, and it will also yield projections that contain clusters. The chi-square projection pursuit index is fast and easy to compute, making it appropriate for large sample sizes. Posse (1995a) provides a formula to approximate the percentiles of the chi-square index.
  • 45. 30    R R/5 45o 1/48 1/48 1/48 1/48 1/48 1/48 1/48 1/48   Figure- 3.3: Layout of the regions for the chi-squareprojection index. (Modified after Posse, 1995a) 3.2.2.2 Finding the structure (PPEDA algorithm) For PPEDA projection pursuit index, ( )2 ,PIχ α β must be optimized over all possible direction onto 2-D planes. Posse (1990) proposed a random search for locating the global maximum of the projection index. Combined with the structure- removal procedure, this gives a sequence of interesting bi-dimensional views of decreasing importance. Starting with random planes, the algorithm tries to improve the current best solution ( )* * ,α β by considering two candidate planes ( )1 1,a b and ( )2 2,a b within a neighborhood of ( )* * ,α β . These candidate planes are given by,    ( ) ( ) ( ) ( ) * ** 1 11 1 1* * * 1 1 1 * ** 1 21 2 2* * * 1 1 2 T T T T a acv a b cv a a a acv a b cv a a β βα α β β β βα α β β ⎫−+ ⎪= = + ⎪− ⎪ ⎬ − ⎪− = = ⎪ − − ⎪⎭ (3.2) Where c is a scalar that determines the size of the neighborhood visited, and v is a unit p-vector uniformly distributed on the unit p-dimensional sphere. The idea is to
  • 46. 31    start a global search and then to concentrate on the region of the global maximum by decreasing the value of c. After a specified number of steps, called half, without an increase of the projection index, the value of c is halved. When this value is small enough, the optimization is stopped. Part of the search still remains global to avoid being kept in dummy local optimum. The complete search of the best plane contains m such random searches with different random starting planes. The goal of PP algorithm is to find best projection plane. The steps for PPEDA are given below: 1. Sphere the OD set, let’s say,Z is the matrix of sphered data set. 2. Generate a random starting plane ( )0 0 ,α β , where 0 0 andα β are orthonormal. Consider this as the current best plane ( )* * ,α β . 3. Evaluate the projection index ( )2 * * ,PIχ α β for the starting plane. 4. Generate two candidate plane ( )1 1,a b and ( )2 2,a b according to the Eq. (3.2) 5. Now calculate the projection index for these candidate planes. 6. Choose the candidate plane with a higher value of the projection pursuit index as the current best plane ( )* * ,α β . 7. Repeat steps 4 through 6 while there are improvements in the projection pursuit index. 8. If the index does not improve for certain time, then decrease the value of c by half 9. Repeat step 4 to step 8 until c becomes some small number (say .01). 3.2.2.3 Structure removal There may be more than one interesting projection, and there may be other views that reveal insights about the hyperspectral data. To locate other views, Friedman (1987) proposed a method called structure removal. In this approach, first we perform the PP algorithm on the data set to obtain the structure which means the optimal projection plane. The approach then removes the structure found at that projection, and repeats the projection pursuit process to find a projection that yields another maximum value of the projection pursuit index. By proceeding in this
  • 47. 32    manner, it will give a sequence of projections providing informative views of the data. The procedure repeatedly transforms the projected data to standard normal until they stop becoming more normal as measured by the projection pursuit index. One starts with a p p× matrix, where the first two rows of the matrix are the vectors of the projection obtained from PPEDA. The rest of the rows have ‘1’ on the diagonal and ‘0’ elsewhere. For example, if p = 4, then  * * * * 1 2 3 4 * * * * * 1 2 3 4 0 0 1 0 0 0 0 1 U α α α α β β β β ⎡ ⎤ ⎢ ⎥ ⎢ ⎥= ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ (3.3) Gram-Schmidt orthonormalization process (Strang, 1988) makes the rows of * U orthonormal. Let U is the orthonormal matrix of * U . The next step in the structure removal process is to transform the Z matrix using the following equation,    T T UZ= (3.4) Where T is a p n× matrix. With this transformation, the first two rows of T of every transformed observations are the projection onto the plane given by ( )* * ,α β . Now applying a transformation (Θ ), which transforms the first two rows of T to a standard normal and the rest remain unchanged, structure removal is performed (Martinez, 2004). This is where the structure is removed, making the data normal in that projection (the first two rows). The transformation is defined as follows,    ( ) ( ) ( ) ( ) ( ) 1 1 1 1 2 2 3,4,.........,i i T F T T F T T T i p φ φ − − ⎫⎡ ⎤Θ = ⎣ ⎦ ⎪⎪ ⎡ ⎤Θ = ⎬⎣ ⎦ ⎪ Θ = = ⎪⎭ (3.5) Where 1 φ− the inverse of the standard normal cumulative distribution function, 1T and 2T are the first two rows of the matrix T and F is a function defined in Eq. (3.7). From Eq. (3.3), it is seen that only the first two row of T are changing. 1T and 2T can be written as,  ( ) ( ) * * * * * * * * 1 1 2 2 1 2 , ......., ,......., , ......., ,......., j n j n T z z z z T z z z z α α α α β β β β = = (3.6)
  • 48. 33    Where * jzα and * jzβ are coordinates of the jth observation projected onto the plane spanned by( )* * ,α β . Next, a rotation is defined about the origin through the angle as follows  ( ) ( ) ( ) ( ) ( ) ( ) 1 1 2 2 2 1 cos sin cos sin t t t j j j t t t j j j z z z z z z γ γ γ γ = + = − (3.7) Where 0, / 4, / 8,3 / 8γ π π π= and ( )1 t jz represents the jth element of 1T at the tth iteration of the process. Now, applying the following transformation on Eq. (3.7) to the rotated points it replaces each rotated observation by its normal score in the projection.    ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 2 2 1 1 0.5 .5 t jt j t jt j r z z n r z z n φ φ + − + − ⎧ ⎫−⎪ ⎪ = ⎨ ⎬ ⎪ ⎪⎩ ⎭ ⎧ ⎫−⎪ ⎪ = ⎨ ⎬ ⎪ ⎪⎩ ⎭ (3.8) Where ( ) ( )1 t jr z represents the rank of ( )1 t jz With this procedure, the projection index is reduced by making the data more normal. During the first few iteration, the projection index should decrease rapidly (Friedman, 1987). After approximate normality is obtained, the index might oscillate with small changes. Usually, the process takes between 5 to 15 complete iterations to remove the structure. Once the structure is removed using this process, data is transformed back using the following equation,    ( )T T Z U UZ′ = Θ (3.9) From Matrix Theory (Strang, 1988), it is known that all directions that are orthogonal to the structure (i.e., all rows of T other than the first two) have not been changed, whereas the structure has been Gaussianized and then transformed back. Next section will describe the summary of the steps of PP,
  • 49. 34    3.2.2.4 Steps of PP 1. Load the data and set the value of the parameters like number of best projection plane (N), number of neighborhood for random starts (m), value of c and half 2. Sphere the data and obtain the Z matrix. 3. Find each of the desired number of projection plane (structures) (3.3.4.2) using Posse chi-squareindex. 4. Remove the structure (to reduce the effect of local optimum) and find another structure (3.3.4.3) until the projection pursuit index stop changing. 5. Continue the process until the best projection plane (orthogonal to each other) is obtained. 3.2.3 Kernel principal component analysis (KPCA) Kernel principal component analysis (KPCA) means conducting PCT in feature space (kernel space). KPCA is applied on the variables which are nonlinearly related to the input variables. In this section KPCA algorithm has been described through PCA algorithm. First m number of TP ( , 1,........,n ix R i m∈ = ) are chosen. PCA finds the principal axes by diagonalizing the following covariance matrix, 1 1 m T j j j C x x m = = ∑ (3.10) The covariance matrix C is positive definite; hence, non-negative eigen values can be obtained. v Cvλ = (3.11) For PCA, first sort the eigen values in decreasing order and find the corresponding eigen vectors. Then project test point on to eigen vectors. PCs are obtained in this manner. Now next step is rewriting of PCA in terms of dot product. Now substituting Eq. (3.10) in Eq. (3.11)  1 1 m T j j j Cv x x v v m λ = = =∑ Thus
  • 50. 35    ( ) 1 1 1 1 . m T j j j m j j j v x x v m x v x m λ λ = = = = ∑ ∑ (3.12) since ( ) ( ). .T x x v x v x= In Eq. (3.12), the term ( ).jx v is a scalar. This means that all the solutions v with λ ≠ 0 lie in the span of 1,......, mx x , i.e. 1 m i i i v xα = = ∑ (3.13) Steps for KPCA 1. For KPCA, first transform the TPs using a kernel function (Φ ) to feature space ( H ). Data set ( ( ), 1,.....,ix i mΦ = ) in feature space are assumed as centered to reduce the complexity of calculation. The covariance matrix in H of the data set takes the form as following 1 1 ( ) ( ) m T j j j C x x m = = Φ Φ∑ (3.14) 2. Find the eigen values 0λ ≥ and corresponding non zero eigen vectors {0}v H∈ of the covariance matrix C from the equation, v Cvλ = (3.15) 3. As shown in previously (for PCA), all solution of v ( 0λ ≠ ) lie in the span of 1( ),........, ( )mx xΦ Φ , i.e., 1 ( ) m i i i v xα = = Φ∑ (3.16) Therefore, 1 ( ) m i i i Cv v xλ λ α = = = Φ∑ (3.17) Substituting Eq. (3.14) and eq. 3.16 in Eq. (3.17) 1 1 1 ( ) ( ) ( ) ( ) m m m T j j j i i j j i j m x x x xλ α α = = = Φ = Φ Φ Φ∑ ∑∑ (3.18) 4. Define kernel inner product by ( , ) ( ) ( )T i j i jK x x x x= Φ Φ . Substituting this in Eq. (3.18) following equation is obtained.
  • 51. 36    1 1 1 ( ) ( ) ( , ) m m m j j j i i j j i j m x x K x xλ α α = = = Φ = Φ∑ ∑∑ (3.19) 5. To express the relationship in Eq. (3.19) entirely in terms of the inner-product kernel, premultiply both sides by ( )T kxΦ for all k = 1,……,m. Define the m ×m matrix K, called the kernel matrix, whose ijth element is the inner-product kernel , ( , )i jK x x . The vector α of length m, whose jth element is the coefficient jα . 6. Finally, Eq. (3.19) can be written as, 1 1 1 1 ( ) ( ) ( ) ( ) ( ) ( ) 1,2,...., m m m T T T j k j j k i i j i i j x x x x x x m k m λ α α = = = Φ Φ = Φ Φ Φ Φ ∀ = ∑ ∑∑ (3.20) Now Eq. (3.20) can be transformed as (using ( , ) ( ) ( )T i j i jK x x x x= Φ Φ ), 2 m K Kλ α α= (3.21) To find the solution of Eq. (3.21), an eigen value problem Eq. (3.22) needs to be solved, m Kλα α= (3.22) 7. Solution of Eq. (3.22) provides the eigen values and eigen vectors of the kernel matrix K. Let 1 2 ........ mλ λ λ≥ ≥ ≥ be the eigen values of K and 1 2, ,......., mβ β β be the corresponding set of eigen vectors with pλ being the last non zero eigen value.  
  • 52. Figu 8. To eige H. T 9. In t it is H ( feat equ Figure-3.5   ( ure 3.4: (a Th on extract pr en vectors β Then the above a s certainly Schölkopf, ture space ation for k ,i jK 5 provides t (a) a) Input po he three g nly (Wikipe incipal com nβ in H (n β algorithm, difficult to 2004). Th . However kernel PCA ( 1mK K= − the outline   oints befor groups are edia, 2010) mponent, i 1,...., p= ). , ( )n xβ Φ = ∑ it has bee o obtain th herefore, it r, there is A. It is need 1 1m mK K− + e of KPCA a 37 e kernel P e distingui ). it is neede Let x be a 1 ( ) m n i i xβ = Φ∑ n assumed he mean of is problem a way to ded to diago ) , 1 Wm m i j K algorithm. PCA (b) Ou shable usi ed to comp a test point ), ( )xΦ d that the d f the mappe matic to cen o do it by onalize the Where (1 )m ij   (b) utput after ing the fir pute projec t, with an i data set is ed data in nter the m slightly m e kernel ma 1 : ,i j m = ∀   r kernel P rst compon ction onto image (xΦ (3.2 centered, feature sp mapped data modifying atrix K, (3.2 CA. nent the ) in 23) but pace a in the 24)
  • 53. 38      Figure 3.5: Outline of KPCA algorithm 3.2.4 Orthogonal subspace projection (OSP) subspace projection is to eliminate all unwanted or undesired spectral signatures (background) within a pixel, then use a matched filter to extract the desired spectral signature (endmember) present in that pixel.  
  • 54. 39    3.2.4.1 Automated target generation process algorithm (ATGP) In hyperspectral image analysis a pixel may encompass many different materials; such pixels are called mixed pixels. It contains multiple spectral signatures. Let a column vector ir represent the mixed pixel by linear model,      i i ir M nα= + (3.25) where the vector ir is a 1l × column vector, represents the ith mixed pixel. l is the number of spectral bands. Each distinct material in the mixed pixel is called an endmember (p). Assume that there are p spectrally distinct endmembers in the ith mixed pixel. M is a matrix of dimension l p× , is made up of linearly independent columns. These columns are denoted by ( )1 2, ,......, ,.......,j pm m m m . Here this system is considered as over determined (l p> ) system and jm denotes the spectral signature of the jth distinct material or endmember. Let α be a p column vector given by ( )1 2, ,......, ,......, T j pα α α α where the jth element represents the fraction of the jth signature as present in the ith mixed pixel. ni is a 1l × column vector presenting the white Gaussian noise with zero mean and covariance matrix 2 Iσ where I is an l l× identity matrix. In the Eq. (3.25), assume ir ’s are a linear combination of p endmembers with the weight coefficients designated by the fraction vector iα . The term iMα has been rewritten to separate the desired spectral signatures from the undesired signatures. In other way, targets are being separated from background. In searching for a single spectral signature this can be written as:      pM d Uα α γ= + (3.26) Where d is l l× matrix, the desired signature of interest containing column vector mp while pα is 1 1× , the fraction of the desired signature. The matrix U is composed of the remaining column vectors from M. These are the undesired spectral signatures or background information. This is given by ( )1 2 , 1, ,....., ........,j pU m m m m −= with dimension ( 1)l p× − where γ is a column vector containing rest of ( )1p − components (fractions) of α
  • 55. 40    Suppose P is an operator, which eliminates the effects of U, the undesired signatures. To do this, an operator (orthogonal subspace operator) has been developed that projects r onto a subspace that is orthogonal to the columns of U. This results in a vector that only contains energy associated with the target d and noise n. The operator used is the l l× matrix      ( )1 1 ( )T T P U U U U− = − (3.27) The operator P maps d into a space orthogonal to the space spanned by the uninteresting signatures in U. Now apply the operator P on the mixed pixel r from Eq. (3.25)    Pr pPd PU Pnα γ= + + (3.28) It should be noticed that P operating on Uγ reduces the contribution of U to zero (close to zero in real data applications). Therefore, from above rearrangement we have    Pr pPd Pnα= + (3.29) 3.2.4.1 Signal-to-Noise Ratio (SNR) Maximization The second step in deriving the pixel classification operator is to find the 1 l× operator T X that maximizes the SNR. Operating on Eq. (3.28) get PrT T T T pX X Pd X PU X Pnα γ= + + (3.30) The operator T X acting on Pr will produce a scalar (Ientilucci, 2001), The SNR is given by,      2T T T p T T T X Pd d P X X PE nn P X α λ = ⎡ ⎤⎣ ⎦ (3.31) 2 2 T T T p T T X Pdd P X X PP X α λ σ ⎛ ⎞ = ⎜ ⎟⎜ ⎟ ⎝ ⎠ (3.32) where [ ]E denotes the expected value. Maximization of this quotient is the generalized eigenvector problem   T T T Pdd P X PP Xλ= (3.33)
  • 56. 41    where 2 2 p σ λ λ α ⎛ ⎞ = ⎜ ⎟⎜ ⎟ ⎝ ⎠ , The value of T X which maximizes λ can be determined in general using techniques outlined by (Miller, Farison, Shin,1992) and the idempotent and symmetric properties of the interference rejection operator. As it turns out the value of T X which maximizes the SNR is    T T X kd= (3.34) where k is an arbitrary scalar. Substituting the result in Eq. (3.34) into Eq. (3.30) it is seen that the overall classification operator for a desired hyperspectral signature in the presence of multiple undesired signatures and white noise is given by the 1 l× vector as      T T q d p= (3.35) This result first nulls the interfering signatures, and then uses a matched filter for the desired signature to maximize the SNR. When the operator is applied to all of the pixels in a hyperspectral scene, each 1l × pixel is reduced to a scalar which is a measure of the presence of the signature of interest. The ultimate aim is to reduce the l images that make-up the hyperspectral image cube into a single image where pixels with high intensity indicate the presence of the desired signature. This operator can be easily extended to seek out k signatures of interest. The vector operator simply becomes a k l× matrix operator which is given by,    ( )1 2, ,...., ,....,j kQ q q q q= (3.36) When the operator in Eq. (3.36) is applied to all of the pixels in a hyperspectral scene, each 1l × pixel is reduced to 1 1× vector. Ultimately, l dimensional hyperspectral image reduces to single dimensional feature extracted image where pixels with high intensity indicate the presence of the desired signature. Thus for k desired signature hyperspectral image can be reduce to k dimensional feature extracted image. Here each band corresponds to the each desired signature. The above algorithm is discussed with the following example: Let us start with three vectors or classes, each six elements or bands long. The vectors are in reflectance units and can be seen below.
  • 57. 42    0.26 0.07 0.07 0.30 0.07 0.13 0.31 0.11 0.19 0.31 0.54 0.25 0.31 0.55 0.30 0.31 0.54 0.34 Concrete Tree Water ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = = =⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦ Suppose the image consists of 100 pixels starting from left to right. Let 40th pixels looks like, ( ) ( ) ( )40 .08 .75 .07pixel concrete tree dirt noise= + + + (3.37). Let us assume that the noise is zero. If all the pixel mixture fractions have been defined, particular class spectrum can be chosen to extract from the image. Suppose the concrete material has to be extracted throughout the image. Same procedure can be followed to extract grass and tree material. Assume that 40pixel is made up some weighted linear combination of endmembers. 40pixel M noiseα= + (3.38) Now Mα can be break up into desired, dα and undesired, Uγ signatures. Now assign the desired as d and undesired as U signatures to spectrum. Let concrete be the vector d and tree and water be the column vectors of the matrix U. However, the fractions of mixing are unknown to us. But it is known that 40pixel is made up of some combination of d and U. ,d concrete and U tree water= =⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦ Now it is required to reduce the effect of U. To do this it is needed to find a projection operator P, that when operated on U, will reduce its contribution to zero. To find concrete, d, 40pixel is projected onto a subspace that is orthogonal to the columns of U using the operator P. In other words, P maps d into a space orthogonal to the space spanned by the undesired signatures while simultaneously minimizing the effects of U. If P is operated on U, which contains tree and water, then it is seen that the effect of U is minimized.
  • 58. 43    00 0 0 0 0 0 0 0 0 PU ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥= ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ (3.39) Now let 1r = 40pixel and n = noise, then from eq. (3.29), 1Pr pPd Pnα= + (3.40) Now operator T x needs to find out which will maximizes the signal-to noise ratio (SNR). The operator T x acting on 1Pr will produce a scalar. As stated before, the value of T x which maximizes the SNR is T T X kd= . This leads to an overall OSP operator (Eq. (3.35)). Such a way the matrix Q in Eq. (3.36) can be formed. Now the entire data vector can be project along the columns of Q and OSP feature extracted image is formed. 3.3 Supervised classifier This section describes the mathematical background of supervised classifiers. First, it will describe the Bayesian decision rule followed by the decision rule for Gaussian maximum likelihood classifier (GML). Afterwards it will describe the k- nearest neighbor (KNN) and Support vector machine (SVM) classification rules. 3.3.1 Bayesian decision rule In pattern recognition, patterns need to be classified. There are plenty of decision rules available in literatures but only Bayes Decision Theory is optimal (Riggi and Harmouche, 2004). It is based on the popular Bayes theorem. Suppose there are K classes and let ( )f x k be the distribution function of the kth class, where 0 k K< < , and ( )kP c is the prior probability of the kth classes such that 1 ( ) 1 K k k P c = =∑ . For any class k , the posteriori probability for a pixel vector x is denoted by ( )|k kp c x and defined by (assuming all classes are mutually exclusive):    1 ( | ) ( ) ( | ) ( ) ( ) k k k k K k k k k P x c P c p c f P c = = = ∑ x x (3.41)
  • 59. 44    Therefore, the Bayes decision rule is:      ( | ) max ( | )i i i k k k c if p c p c∈ =x x x (3.41a)   3.3.2 Gaussian maximum likelihood classification (GML): Gaussian maximum likelihood classifier assumes that the distribution of the data points is Gaussian (normally distributed) and classifies an unknown pixel based on the variance and covariance of the spectral response patterns. This classification is based on probability density function associated with training data. Pixels are assigned to the most likely class based on a comparison of the posterior probability that it belongs to each of the signatures being considered. Under this assumption, the distribution of a category response pattern can be completely described by the mean vector and the covariance matrix. With these parameters, the statistical probability of a given pixel value being a member of a particular land cover class can be computed (Lillesand et al., 2002). GML classification can obtain minimum classification error under the assumption that the spectral data of each class is normally distributed. It considers not only the cluster centre but also its shape, size and orientation by calculating a statistical distance based on the mean values and covariance matrix of the clusters. The decision boundary for the GML classification is: ( ) 1ˆ ˆˆ ˆ(1 2) ln ( ) ( )T k k k k −⎡ ⎤− + − −⎢ ⎥⎣ ⎦ x xΣ Σμ μ (3.42) And the final bayesian decision rule is: ( ) max ( )j j k k c if g g∈ =x x x where ( )kg x is the decision boundary function for kth class. 3.3.3 k – nearest neighbor classification   KNN algorithm (Fix and Hodges, 1951) is a nonparametric classification technique which has been proven to be effective in pattern recognition. However, its inherent limitations and disadvantages restrict its practical applications. One of the shortages is lazy learning which makes the traditional KNN time-consuming. In this thesis work traditional KNN process has been applied (Fix and Hodges, 1951). The k-nearest neighbor classifier is commonly based on the Euclidean distance between a test pixel and the specified TP. The TP are vectors in a multidimensional feature space, each with a class label. In the classification phase, k is a user-defined
  • 60. 45    constant. An unlabelled vector i.e. test pixel, is classified by assigning the label which is most frequent among the k training samples nearest to that test pixel.                Figure 3.6: KNN classification scheme. The test pixel (circle) should be classified either to the first class of squares or to the second class of triangles. If k = 3, it is classified to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5, it is classified to first class (3 squares vs. 2 triangles inside the outer circle).If k = 11, it is classified to first class (6 squares vs. 5 triangles) (Modified after Wikipedia, 2009). Let x be a n -dimensional test pixel and iy ( (1,2.... ))i p= is n -dimensional TP, Euclidian distance between them is defined by: 2 2 2 11 1 12 2 1( , ) ( ) ( ) ...... ( )i i i i n ind x y x y x y x y= − + − + + − (3.43) Where 11 12 1( , ...... ),nx x x x= 1 2( , ...... )i i i iny y y y= and 1 2{ , ...... }pD d d d= , p is number of TP The final KNN decision rule is:
  • 61. 46    j 1 , even 2 if minimum element of D corresponding to c is , odd 2 j k k x c k k ⎧ ⎫⎛ ⎞⎡ ⎤ +⎪ ⎪⎜ ⎟⎢ ⎥ ⎪ ⎪⎣ ⎦⎝ ⎠∈ ⎨ ⎬ ⎡ ⎤⎪ ⎪ ⎢ ⎥⎪ ⎪⎢ ⎥⎩ ⎭ (3.44) In case of tie, the test pixel is assigned to the class jc if its distance from the mean vector of the class jc is minimum. Where ,( 1,2,....., )ik i p= is a user defined parameter which implies the number of nearest neighbor is chosen for classification. The outline of algorithm of KNN classification is given in Figure: 3.7 Figure 3.7: Outline of KNN algorithm 3.3.4 Support vector machine (SVM): The foundations of Support Vector Machines (SVM) have been developed by Vapnik (1995). The formulation represents the Structural Risk Minimization (SRM)
  • 62. 47    principle, which has been shown to be superior, (Gunnet al., 1997), to traditional Empirical Risk Minimization (ERM) principle, employed by conventional neural networks. SRM minimizes an upper bound on the expected risk, as opposed to ERM that minimizes the error on the training data. SVMs were developed to solve the classification problem, but recently they have been extended to the domain of regression problems (Vapnik et al., 1997). SVM is basically a linear learning machine based on the principle of optimal separation of classes. The aim is to find a hyperplane which linearly separates the class of interest. The linear separating hyperplane is placed between the classes in such a way that it satisfies two conditions. (i) All the data vector that belongs to the same class are placed to the same side of separating hyperplane. (ii) Distance between two closest data in both classes is maximized (Vapnik, 1982). The main aim of SVM is to define an optimum hyperplane between two classes which will maximize the boundary of two classes. For each class, the data vectors forming the boundary of classes are called the support vectors (SV) and the hyperplane is called decision surface (Pal, 2002). 3.3.4.2 Statistical learning theory The goal of statistical learning theory (Vapnik, 1998) is to create a mathematical framework for learning from input training with known class and predict the outcome of data point with unknown identity. The first is called ERM whose aim is to reduce the training error and the second is called SRM, whose goal is to minimize the upper bound on the expected error on the whole data set. The empirical risk is different from the expected risk in two ways (Haykin, 1999). First, it does not depend on the unknown cumulative distribution function. Secondly, it can be minimized with respect to the parameter, which is used in decision rule. 3.3.4.2 Vapnik and Charvonenkis dimension (VC-dimension): VC dimension is a measure of the capacity of a set of classification functions. The VC-dimension, generally denoted by h, is an integer that represents the largest number of data points that can be separated by a set of functions fα in all possible ways. For example, for a arbitrary classification problem, VC-dimension is the maximum
  • 63. 48    number of points, which can be separated into two classes without error in all possible 2k ways (Varshney and Arora, 2004). 3.3.4.3 Support vector machine algorithm with quadratic optimization method (SVM_QP):  The procedure of obtaining a separating hyperplane by SVM is explained for a simple linearly separable case for two classes which can be separated by a hyperplane and it can be extended for the multiclass classification problem. This procedure then can be extended to the case where a hyperplane cannot separate the two classes that is kernel method for SVM. Let there are n number of training samples obtained from two classes, represented as 1 1 1 1( , ),( , ),..........,( , )n nx y x y x y where m ix R∈ , m is the dimension of the data vector with each sample belonging to either of the two classes labeled by { 1, 1}y∈ − + . These samples are said to be linearly separable if there exists a hyperplane in m-dimensional space whose orientation is given by a vector w and whose location is determined by a scalar b as offset of this hyperplane from the origin (Figure 3.8). In case such a hyperplane exists then the given set of training data points must satisfy the following inequalities: 1, : 1i iw x b i y⋅ + ≥ + ∀ = + (3.45) 1, : 1i iw x b i y⋅ + ≤ − ∀ = − (3.46) Thus, the equation of hyperplane is given by 0iw x b⋅ + = .
  • 64. 49    Figure 3.8: Linear separating hyperplane for linearly separable data (Modified after Gunn, 1998). The inequalities in Eq. (3.45) and Eq. (3.46) can be combined into a single inequality as: ( . ) 1i iy w x b+ ≥ (3.47) Thus, the decision rule for the linearly separable case can be defined in the following form: ( . )i ix sign w x b∈ + (3.48) Where, (.)sign is the signum function whose value is +1 for any element greater than or equal to zero, and –1 if it is less than zero. The signum function, thus, can easily represent the two classes given by labels +1 and –1. The separating hyperplane (Figure 3.8) will be able to separate the two classes optimally when its margin from both the classes is equal and maximum (Varshney, 2004) i.e. the hyperplane should be located exactly in the middle of the two classes.
  • 65. 50    The distance ( ; , )D x w b is used to express the margin of separation or margin for a point x from the hyperplane defined by w and b. It is given by 2 . ( ; , ) w x b D x w b w + = (3.49) Where, 2 denotes the second norm which is equivalent to the Euclidean length of the element vector for which it is being computed and is the absolute function. Let d be the value of the margin between two separating planes. To maximize the margin, express the value of d as 2 2 . 1 . 1w x b w x b d w w + + + − = − 2 2 w = 2 T w w = (3.49a) To obtain an optimal hyperplane the margin value (d ) should be maximized i.e. 2 2 w should be maximized, it is equivalent to minimization of the 2-norm of the vector w. Thus, the objective function Φ(w) of finding the best separating hyperplane reduces to     1 ( ) 2 T w w wΦ = (3.50) A constrained optimization problem can be constructed for minimizing the objective function in Eq. (3.50) under the constraints given in Eq. (3.47). This kind of constrained optimization problem with a convex objective function of w and linear constraints is called a primal problem and can be solved using standard Quadratic Programming (QP) optimization techniques. The QP optimization technique can be implemented by replacing the inequalities in a simpler form by transforming the problem into a dual space representation using Lagrange multipliers ( iλ ) (Leunberger, 1984). The vector w can be defined in terms of Lagrange multipliers ( iλ ) as shown:
  • 66. 51    1 1 , 0 t n i i i i n i i i w y x y λ λ = = = = ∑ ∑ (3.51) The dual optimization problem reduced by Lagrange’s multipliers ( λi ) thus becomes      1 1 1 1 max ( , , ) ( ) 2 n n n i i j j i i j i i j L w b y y x x λ λ λ λ λ = = = = − ⋅∑ ∑∑ (3.52) Subjected to the constraints: 1 0 n i i i yλ = =∑ (3.53) 0, 1,2,...,i i nλ ≥ = (3.54) Solution of the optimization problem is obtained in terms of Lagrange’s multiplier. According to Krush-Kuhn-Tucker (KKT) optimality condition (Taylor, 2000) some of the Lagrange’s multiplier will be zero. The multipliers which have nonzero values are called SVs. The result from an optimizer, also called as an optimal solution, will be a set of unique and independent multipliers: 1 2( , ,..., )s o o o o nλ λ λ λ= where, sn is the number of support vectors found. Substituted these in Eq. (3.51) to obtain the orientation of optimal separating hyperplane ( o w ) as    0 0 1 n i i i i w y xλ = = ∑ (3.55) The offset from origin ( 0 b ) is determined from the equation given below,    0 0 0 0 0 1 1 1 2 b w x w x+ − ⎡ ⎤= +⎣ ⎦ (3.56) Where 0 1x+ and 0 1x− are support vector of class labels +1 and -1 respectively. The following decision rule (obtained from Eq. (3.48)) is then applied to classify the data vectors into two classes +1 and -1:    0 0 support vectors ( ) ( ( . ) )i i if x sign y x x bλ= +∑ (3.57) Eq. (3.57) implies that 0 0 support vectors ( ( . ) )i i ix sign y x x bλ∈ +∑ (3.58)