Support vector classification
A support vector machine is an example of a supervised, multivariate classification method.
SVMs are supervised in the sense that they are ‘trained’ to learn about differences between
the groups to be classified. The method has previously been applied to neuroimaging data
(Fan et al., 2005a; Fan et al., 2005b; Kawasaki et al., 2006; Lao et al., 2004; Mourao-Miranda
et al., 2005). The image data do not need to satisfy the assumptions of random Gaussian field
theory so that image smoothing is unnecessary.
SVMs are related to other multivariate methods such as canonical variate analysis, a method
successfully applied to FA-images of patients with Alzheimer’s disease (Teipel et al., 2007).
General account of SVMs
Here we give a short account of SVM to convey the basics of the method to the non-technical
reader. In the context of machine learning, individual MR images are treated as points located
in a high dimensional space. Supplementary figure 1 illustrates this in a two-dimensional
space for simplicity: the circles and squares cannot be separated by their values along a single
dimension. Only a combination of two dimensions allows reliable separation.
Supplementary figure 1: Illustrating the concept of group separation in a high-dimensional
space and the concept of a decision boundary.
In practical terms, a linear kernel matrix from the normalised FA-images is created. To this
end, each scan undergoes a pair wise "dot product" with all other scans. For a given pair of
scans, each voxel value in one scan is multiplied with the corresponding voxel value (i.e. the
voxel from the same point in the brain) from the other scan. The results from those
multiplications (as many as there are voxels in each FA-image) are then summed up and form
an element in the kernel matrix. In the matrix, each row or column represents the result of a
dot product of one scan with all the other scans. The diagonal is formed by the dot products of
each scan with itself. Intuitively, the kernel matrix can be viewed as a similarity measure
among subjects belonging to a characterised group. The number of dimensions is normally the
number of voxels in an image. The voxels are effectively treated as coordinates of a high
dimensional space. There are as many dimensions as there are voxels in the FA-map. Images
are distributed in the space and their location is determined by the intensity value of each
voxel. The images do not span the whole high dimensional space, but rather cluster in
subspaces that contain images that are very similar. This is one reason why image
normalisation into a standard space is an important pre-processing step. Good normalization
will tighten this grouping and reduce dimensionality.
SVM used for classification into classes is an example of a linear discriminative model. The
basic model is a binary classifier, which means it divides the space where the MR images are
distributed into two classes by finding a separating hyperplane. In a simple two dimensional
space this would be a separating line or boundary (see figure), but in higher dimensional
spaces it is called a hyperplane. Fisher’s linear discriminate analysis or linear perceptron can
both derive linear discriminant hyperplanes. However, the motivation behind a SVM is called
“structural risk minimization”, which aims to find the hyperplane that maximizes the distance
between two classes that is generated by training. Intuitively, it is reasonably clear that any
optimal separating hyperplane (OSH) in a SVM is mostly defined by data samples that are
close to the separating boundary between two classes, i.e., those samples which are most
ambiguous. These training samples are called the “support vectors”. Samples which are
further away from a separating boundary are distinctively different and hence not used to
calculate the OSH. This suggests that adding more samples to a training set may not improve
definition of an OSH if the new ones are further from the OSH.
After training an SVM, the OSH defines the learned differences between groups (in our case
PSCs and controls). At this point, it is important to know how well this separation will
generalise, as it is possible that the OSH is specific only to the data used for training.
Therefore, a validation step is used to assess the accuracy of the classifier by how well it
generalizes for other data. A number of methods are available for this; one such method is
leave-one-out cross-validation. This procedure iteratively repeats SVM training by leaving out
a single image from the training procedure. After each training step, a prediction is made for
the excluded image, which is compared with the ground truth. By leaving out each of the
images in turn, it is possible to determine the accuracy with which the classifier will
generalize to new data. It is important to note that each image is never part of both the training
and testing set in each given validation procedure. This is further illustrated in the textbooks
In addition to testing if a specific pattern of white matter changes exists we were interested in
determining which pattern of voxels is most relevant for classification. During the training
process the SVM assigned a specific weight to every image reflecting the importance of that
scan in separating our two groups; the weight is zero for non-contributing images. The
weights are multiplied by a label vector indicating which group the image belongs to (e.g., –1
for PSC and +1 for controls). Each image is then multiplied by the result of the multiplication
of its label and weight. Images from each group are then summed resulting in a value for each
voxel indicating how important it is for discrimination.
The interested reader is referred to the following textbooks (Bishop, 2006; Vapnik, 1998).
Voxel based analysis of T1-weighted data
All T1 weighted images were analysed using SPM5 (www.fil.ion.ucl.ac.uk/spm/). Images
were segmented into grey matter, white matter and normalised to MNI space using a unified
approach developed by Ashburner and colleagues (Ashburner and Friston, 2005).This
technique employs prior tissue probability maps (TPMs) for each tissue class that code the
probability of each voxel belonging to a given tissue class. The intensity distribution of voxels
from each class is modelled as a mixture of Gaussians. After an initial affine normalisation
step the TPMs are then warped to fit individual T1 images. Parameters for bias correction,
tissue classification and spatial normalisation are iteratively estimated from the same
generative model. An additional step, usually referred to as modulation, is included to
compensate for the effect of spatial normalisation. This step involves multiplying the spatially
normalised segmented images by their relative volume before and after spatial normalisation
(Ashburner and Friston, 2000). After this step, the values of each voxel represent a measure of
the local volume of that tissue class. Finally, we smoothed the data using an isotropic
Gaussian smoothing kernel of 10 mm (full width at half maximum). This was done to render
the data more normally distributed and to account for the inexact nature of the normalisation
process. Data between the two groups was compared with two sample t-tests. We display the
results (supplementary Fig 2) at an exploratory threshold of p=0.01 (uncorrected).
Supplementary Fig. 2. Results when testing for areas with a greater local grey matter volume
in controls compared to PSC. Results are overlaid on a single subject’s image in MNI-space.
The striatum bilaterally as well as adjacent insular cortex were found to show increased local
grey matter volume in PSCs compared to controls. Relatively few cortical areas were found to
differ between the groups even with this liberal exploratory threshold of p<0.01. The region
indicated by the cross-hairs (x,y,z = -36, -18, 44 in MNI-space; T-score=3.7; p=0.001;
uncorrected p-value; FWE corrected p-value=1.00) did not correlate with subject specific
levels of voluntary-guided saccade impairment.
Ashburner J, Friston KJ. Voxel-based morphometry--the methods. Neuroimage 2000; 11:
Ashburner J, Friston KJ. Unified segmentation. Neuroimage 2005; 26: 839-51.
Bishop C. Pattern recognition and machine learning. New York: Springer, 2006.
Burgess C. A tutorial on support of vector machines for pattern recognition. Data Mining and
Knowledge Discovery 1998: 121-167.
Fan RE, Chen PH, Lin CJ. Working set selection using the second order information for
training SVM. Journal of Machine Learning Research 2005a; 6: 1889-1918.
Fan Y, Shen D, Davatzikos C. Classification of structural images via high-dimensional image
warping, robust feature extraction, and SVM. Med Image Comput Comput Assist
Interv Int Conf Med Image Comput Comput Assist Interv 2005b; 8: 1-8.
Kawasaki Y, Suzuki M, Kherif F, Takahashi T, Zhou SY, Nakamura K, et al. Multivariate
voxel-based morphometry successfully differentiates schizophrenia patients from
healthy controls. Neuroimage 2006.
Lao Z, Shen D, Xue Z, Karacali B, Resnick SM, Davatzikos C. Morphological classification
of brains via high-dimensional shape transformations and machine learning methods.
Neuroimage 2004; 21: 46-57.
Mourao-Miranda J, Bokde AL, Born C, Hampel H, Stetter M. Classifying brain states and
determining the discriminating activation patterns: Support Vector Machine on
functional MRI data. Neuroimage 2005; 28: 980-95.
Teipel SJ, Stahl R, Dietrich O, Schoenberg SO, Perneczky R, Bokde AL, et al. Multivariate
network analysis of fiber tract integrity in Alzheimer's disease. Neuroimage 2007; 34:
Tipping M. Sparse Bayesian learning and the relevance vector machine. Journal of Machine
Learning Research 2001; 1: 211-244.
Vapnik V. Statistical Learning Theory. New York: Wiley Interscience, 1998.