SlideShare a Scribd company logo
Faculty of Geosciences 
and Environment 
Support Vector Machines for Spatio-Temporal 
Avalanche Forecasting 
Giona Matasci 
Master of Science in Environmental Geosciences 
Supervisors: Experts: 
Prof. Mikhail Kanevski Dr. Ross Purves 
Dr. Alexei Pozdnoukhov Devis Tuia 
January 2009
Title page image: 
Aonach Mor cornices, source: 
saislochaber.blogspot.com
i 
Abstract 
Statistically based methods for avalanche forecasting have been widely developed in many 
regions subject to this kind of natural hazard to detect avalanche days. Such techniques are 
often based on simple supervised classification methods like Nearest Neighbors and only focus 
on the temporal component of the avalanche activity. The purpose of this Master thesis is to 
build a reliable spatio-temporal forecasting model that is able to efficiently integrate spatial 
information about avalanche events. The application of machine learning algorithms for 
patter recognition, namely Support Vector Machines, is demonstrated with a case study on 
a dataset from Lochaber, Scotland, UK. Encouraging results were obtained in this extension 
of the usual forecasting procedure. 
The meteorological and snowpack factors globally describing avalanche likelihood in the 
mountain area have been combined with spatial features (issued from a Digital Elevation 
Model) related to the avalanche paths where the events have been observed. Hence, thanks 
to a huge database consisting of 17 years of daily condition observations matched with release 
occurrences, we could develop an excellent decision-support tool to assess the avalanche 
danger with a considerable spatial resolution (gullies, particular slopes, etc.). 
Interesting results, expressed in terms of confusion matrices related to the predictions 
on a test dataset (forecasts of gullies avalanche activity) as well as avalanche danger maps, 
are presented in this research report. Besides, the behavior of the model in discriminating 
safe/risky situations when dealing with critical changing conditions affecting the snowpack is 
proven to be consistent after a perceptive validation based on the analysis of some observed 
cases (a specified avalanche path on a given day). Moreover, the use of SVMs auxiliary 
techniques allowed to automatically highlight the most meaningful features to include in sta-tistical 
models aimed at successfully predicting avalanche releases in time and space. Finally, 
always taking the same state-of-the-art learning machine as starting point, elements of the 
sensitivity of the model and suggestions concerning a possible improvement of the avalanche 
monitoring procedure are also provided. 
Keywords: statistical avalanche forecasting, natural hazards, spatial and temporal ap-proach, 
machine learning, supervised classification, kernel methods, Support Vector Ma-chines, 
Nearest Neighbors, feature selection, active learning, sensitivity analysis, avalanche 
path, GIS mapping, Lochaber region
ii 
R´esum´e 
Les m´ethodes statistiques de pr´evision d’avalanches ont ´et´e largement d´evelopp´ees dans de 
nombreuses r´egions sujettes `a ce type de danger naturel. Ces techniques sont souvent fond´ees 
sur de simples m´ethodes de classification supervis´ee comme celles des Plus Proches Voisins 
(Nearest Neighbors) et se concentrent seulement sur la composante temporelle du danger 
d’avalanches. Le but de ce travail de Master est de construire un fiable mod`ele de pr´ediction 
au niveau spatio-temporel capable ainsi d’int´egrer efficacement des informations spatiales sur 
les ´episodes d’avalanche. L’application d’algorithmes d’apprentissage automatique (machine 
learning) pour la reconnaissance des formes, `a savoir celui des S´eparateurs `a Vaste Marge 
(Support Vector Machines), comme il a ´et´e d´emontr´e avec un cas d’´etude concernent la 
r´egion de Lochaber en ´ Ecosse, Royaume-Uni, a r´ev´el´e des r´esultats encourageants dans cette 
extension des proc´edures habituelles de pr´evision. 
Les facteurs m´et´eorologiques et ceux li´es au manteau neigeux d´ecrivant globalement les 
conditions d’avalanche ont ´et´e combin´es avec des informations spatiales (sorties d’un Mod`ele 
Num´erique de Terrain) li´es aux couloirs d’avalanche o`u les ´ev´enements ont ´et´e observ´es. Ainsi, 
grˆace `a une vaste base de donn´ees constitu´ee de 17 ann´ees d’observations quotidiennes des 
situations d’avalanche et d´eclenchements associ´ees, nous avons pu obtenir un excellent outil 
d’aide `a la d´ecision pour ´evaluer le danger d’avalanche avec une bonne r´esolution spatiale 
(ravines, types de pentes sp´ecifiques, etc.). 
Des int´eressants r´esultats en termes de matrices de confusion li´es aux pr´edictions sur un 
ensemble de donn´ees de test (pr´evisions de l’activit´e avalancheuse des diff´erents couloirs), 
ainsi que des cartes de danger d’avalanche sont pr´esent´es dans ce rapport. En outre, le 
comportement du mod`ele lors de la discrimination des situations sˆures de celles `a risque, 
dans le cadre d’une ´evolution critique des conditions affectant le manteau neigeux, s’est 
av´er´e ˆetre tr`es satisfaisant. Cela apr`es une validation perceptive bas´ee sur l’´etude de cas 
r´eellement observ´es (un couloir d’avalanche bien d´efini en un jour donn´e). En outre, le recours 
`a des techniques auxiliaires li´ees aux SVMs a permis de mettre en ´evidence automatiquement 
quelles sont les variables les plus importantes `a inclure dans les mod`eles statistiques visant `a 
pr´edire avec succ`es les avalanches dans le temps et dans l’espace. Enfin, toujours en utilisant 
la mˆeme performante m´ethode d’apprentissage supervis´e comme point de d´epart, des ´el´ements 
sur la sensibilit´e du mod`ele et des suggestions concernant une ´eventuelle am´elioration de la 
proc´edure de contrˆole des avalanches sont ´egalement fournis. 
Mots cl´es: pr´evision d’avalanches statistique, dangers naturels, approche spatiale et 
temporelle, apprentissage automatique, classification supervis´ee, m´ethodes `a noyaux, 
S´eparateurs `a Vaste Marge, Plus Proches Voisins, s´election des variables, apprentissage 
actif, analyse de la sensibilit´e, couloir d’avalanche, cartographie SIG, r´egion de Lochaber
iii 
Acknowledgments 
First and foremost, I am grateful for the advice and support of both my supervisors during 
the whole Master program. 
I thank Prof. Mikhail Kanevski for having introduced me to the field of machine learn-ing 
as well as its applications to environmental sciences and for his interest in my research. 
I would like to thank Dr. Alexei Pozdnoukhov, first, for his huge availability and patience 
when supervising me, then, for having guided me with throughout this thesis with construc-tive 
suggestions about the topics to focus on. His great help when dealing either with the 
theoretical aspects of the methods used or with their concrete implementation will not be 
forgotten. Thank you Alexei! 
Dr. Ross Purves is acknowledged for the interesting discussions about avalanche forecasting 
in Scotland and for the useful hints provided. 
Moreover, I also appreciated a lot the aid and ideas given to me by Devis, Loris, Fr´ed 
and the rest of the team of the geomatics group at IGAR during the work for my Master 
thesis. 
A big and deep “grazie” is addressed to my family, in particular to my parents Franca 
and Sandro, for the support they provided me during these years spent at the university in 
Lausanne. 
All my friends scattered in Switzerland as well as the “sp´ecialisation 2” crew of the Master 
deserve gratitude for the funny moments spent together during this period. 
Last but not least, I am grateful to the “US” relatives, namely Louis and Caroline, for 
proofreading the English. 
...and all those I forgot, thank you!
Contents 
1 Introduction 1 
1.1 Objectives and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 
1.2 Prior work on data-driven statistical avalanche forecasting . . . . . . . . . . . 2 
1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 
1.2.2 Prior work on avalanche forecasting in the Lochaber region . . . . . . 3 
2 Machine Learning 6 
2.1 Supervised learning vs. unsupervised learning . . . . . . . . . . . . . . . . . 6 
2.1.1 Nearest Neighbors for classification . . . . . . . . . . . . . . . . . . . . 7 
2.2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 
2.2.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 9 
2.2.2 Structural Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 9 
2.3 Model selection and model assessment . . . . . . . . . . . . . . . . . . . . . . 11 
3 Support Vector Machines for classification 12 
3.1 Large margin linear classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 
3.1.1 Optimal separating hyperplanes . . . . . . . . . . . . . . . . . . . . . . 12 
3.1.2 The optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . 14 
3.1.3 Support Vectors and their relevance . . . . . . . . . . . . . . . . . . . 15 
3.1.4 Soft margin adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 
3.2 Kernel expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 
3.2.1 The principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 
3.2.2 A concrete example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 
3.2.3 Valid kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 
3.2.4 Details on the Gaussian RBF kernel . . . . . . . . . . . . . . . . . . . 19 
3.3 Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 
3.4 Binary classification quality measures . . . . . . . . . . . . . . . . . . . . . . 21 
4 Extensions of the SVMs-based approach 24 
4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 
4.1.1 Methods overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 
4.1.2 SVM-Recursive Feature Elimination . . . . . . . . . . . . . . . . . . . 25 
4.2 Probabilistic SVM output interpretation . . . . . . . . . . . . . . . . . . . . . 26 
iv
CONTENTS v 
4.2.1 Interpretations for decision support . . . . . . . . . . . . . . . . . . . . 26 
4.2.2 The sigmoid transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 
4.2.3 Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 
4.3 Active Learning with SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 
4.3.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 
4.3.2 Overview of the existing techniques . . . . . . . . . . . . . . . . . . . . 28 
5 Avalanche forecasting as a spatio-temporal classification problem 30 
5.1 Avalanche data from Scotland: the Lochaber region case study . . . . . . . . 30 
5.2 Set up of the spatio-temporal classification problem . . . . . . . . . . . . . . . 33 
5.3 Choice and conception of the input features . . . . . . . . . . . . . . . . . . . 34 
6 Prediction of avalanche activity at individual paths 38 
6.1 SVM training and parameters tuning . . . . . . . . . . . . . . . . . . . . . . . 38 
6.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 
6.1.2 Model optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 
6.2 Predictions for years 2006-2007 . . . . . . . . . . . . . . . . . . . . . . . . . . 44 
6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 
6.2.2 Comments and observations . . . . . . . . . . . . . . . . . . . . . . . . 46 
7 Avalanche danger mapping 50 
7.1 Avalanche danger assessment: probabilistic SVM output tuning . . . . . . . . 50 
7.2 Mapping on the prediction grid . . . . . . . . . . . . . . . . . . . . . . . . . . 51 
7.3 Gradient mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 
8 Extended analysis of avalanche data with SVMs-related methods 56 
8.1 Relevant features choice: RFE . . . . . . . . . . . . . . . . . . . . . . . . . . 56 
8.1.1 Set up of the automatic procedure . . . . . . . . . . . . . . . . . . . . 56 
8.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 
8.1.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 
8.2 Model behavior under changing conditions . . . . . . . . . . . . . . . . . . . . 60 
8.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 
8.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 
8.2.3 Results and interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 62 
8.3 Active Learning as an exploratory tool in avalanche monitoring . . . . . . . . 66 
8.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 
8.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 
9 Conclusions 69 
9.1 Main achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 
9.2 Further work on this topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 
A European Danger Scale 76 
B Avalanche danger maps 77
vi CONTENTS 
C MATLAB code: gradient mapping 79
Chapter 1 
Introduction 
1.1 Objectives and motivation 
The machine learning domain, presented in chapter 2, provides many scientific research fields, 
especially in the last few years, with a solid framework based on a full variety of techniques 
aimed at the analysis of datasets of increasing complexity and size. 
Particularly, the environmental sciences area appears to be one of the well-matched sub-jects 
where such methods could be applied. In fact, among the broad variety of subfields 
related to geosciences, the latest progress in the automatic extraction of dependencies from 
data have found a great application on the forecasting of natural hazards, a theme frequently 
discussed during the attended Master program. Predictive models founded on the concepts 
issued from machine learning are robust and very well suited for operational danger assess-ment 
purposes. 
From this point of view, the topic of avalanche forecasting shows significant potential 
promising developments. The statistical approach frequently used to evaluate the likelihood 
of snow releases on the slopes of a mountain (see the prior work report in section 1.2) can 
be improved to obtain an extended and enhanced decision-support system helping avalanche 
forecasters in their daily job. However, the main purpose of this work is to explore the possible 
applications of several machine learning techniques in this research field, without focusing 
particularly on the issues affecting operational aspects of forecasting. The reasons behind 
such an approach are mainly related to the fact that studies joining these two domains are in 
their early stages and the realization that my knowledge of the specificities of the avalanche 
forecasting process is not adequate compared to that of forecasters with years of experience. 
Nonetheless, the scope of this work is to build a reliable predictive model aimed at giv-ing 
an efficient spatial extension of the forecasting systems originally designed to produce 
predictions about global avalanche activity over a whole region. Therefore, the morphologi-cal 
characteristics of the mountain range terrain affecting local scale weather and snowpack 
conditions will be taken into account by the presented learning machine. 
The core of the analysis is centered on the well-known supervised classification method 
named Support Vector Machines (SVMs). This product of Statistical Learning Theory will 
be discussed in chapter 3. The performance of such a classifier when dealing with high- 
1
2 1 Introduction 
dimensional data will allow the incorporation of a wide range of features describing avalanch-ing 
conditions at the level of single avalanche paths. The classification problem will be set up 
by matching these variables with the related actual activity of a given gully, giving rise either 
to an avalanche event or to a safe situation. This spatio-temporal approach to avalanche 
forecasting is described in chapter 5, while the results in terms of the classification quality of 
the predictions for 2006 and 2007 winter seasons are reported in chapter 6. 
While focusing on SVMs as the main root of the methodological part of the work, the 
objectives of the research also consist in developing some tools, based on classical machine 
learning/SVMs data-driven approaches described in chapter 4, used to highlight some prop-erties 
of the avalanche hazard studied by taking into account the spatial variation of the 
phenomenon. The feasibility of a mapping of the avalanche danger over the region under 
study will be considered in chapter 7. Then, we attempt to identify the most useful features 
to involve in the classification task by assessing their real influence on the decisions taken 
by the model and on the evolution of the avalanche danger. Next, we investigate the actual 
sensitivity of the model to changing meteorological and snowpack conditions. Furthermore, 
some suggestions are given for the possible optimization of the information gathering pro-cedure 
through improvements in the avalanche monitoring task. All these aforementioned 
topics will be covered in chapter 8. 
This thesis extends the previous work on this topic (see [29]) carried out by Dr. Alexei 
Pozdnoukhov during his post-doctoral fellowship at the Institute of Geomatics and Analysis 
of Risk (IGAR) of the University of Lausanne (information about the main research achieve-ments 
on www.geokernels.org). The case study that will be treated concerns the region 
called Lochaber, located in the northern part of Scotland, UK, that is subject to numerous 
avalanche events during the winter season. Avalanche data collected on the slopes of these 
mountain ranges were available because of the previous collaboration between IGAR and the 
sportScotland Avalanche Information Service (www.sais.gov.uk) thanks to the contribution 
of Dr. Ross Purves. 
1.2 Prior work on data-driven statistical avalanche fore-casting 
1.2.1 Overview 
Avalanche forecasting is a crucial task for many winter resorts where a lot of skiers, moun-taineers, 
climbers are present every day. The procedure, which results in a report of avalanche 
conditions with associated danger, is carried out manually by the forecasters of the region. 
These experts are in the field every day to understand the evolution of the different factors 
affecting avalanche releases. Information about snowpack conditions and stability, weather 
parameters and actual avalanche activity are collected by the observers on a daily basis. 
Nevertheless, in some skiing venues, numerical models are available to support the de-cisions 
taken based on the experience of the forecasters. Some physical models exist to aid 
in the assessment of snowpack evolution (see [1] for the case of Switzerland) but, generally, 
statistical approaches ensuring forecasting systems are much more commonly used. These
1.2 Prior work on data-driven statistical avalanche forecasting 3 
models are devoted to the prediction of current avalanche activity by looking for similari-ties 
with conditions influencing releases recorded in the past (meteorological and snowpack 
factors essentially). 
The statistical models currently operationally used or tested on real avalanche data are 
producing temporal forecasts about global avalanche activity in a given region on a given 
day. Avalanche days and safe days are discriminated using several different statistically 
based techniques belonging to the supervised learning category (pattern recognition). These 
methods include discriminant analysis [13], regression trees and Nearest Neighbors [3]. 
The last technique mentioned is widely applied for operational forecasting in many dif-ferent 
countries. For example, in Switzerland the NXD system (NXD2000 and NXD-REG, 
described in [14] and [2]), developed by the Swiss Federal Institute for Snow and Avalanche 
Research (SLF), is used at a local and regional scale to help experts produce final avalanche 
danger reports. These specialists receive as model output the 10 most similar days (included 
in the database of past observed conditions) to the current day’s situation. By checking 
under which conditions and in which locations avalanches have been observed on these days, 
they are given concrete helpful information to use in assessing the actual avalanche danger. 
The next subsection will illustrate the use of these nearest neighbors methods in Scotland. 
1.2.2 Prior work on avalanche forecasting in the Lochaber region 
The case study that will be discussed throughout this thesis concerns avalanche forecasting, 
namely the forecasting including the spatial component of the avalanche activity, in the 
Lochaber area, Scotland, United Kingdom. In this introductory part of the work I will 
present a short survey of work done in this field using the same avalanche data. 
Nearest Neighbors model Cornice 
Purves et al. in [30] describe the Nearest Neighbors model developed for the operational 
forecasting of avalanche activity in the Scottish mountainous region under study. In con-junction 
with local avalanche forecasters, the scientists involved in this project implemented 
a decision-support system called Cornice which is providing useful information about past 
avalanching conditions helpful in producing a reliable hazard report. The forecasts are made 
available in the afternoon (around 3 pm) and include a description of the situation experi-enced 
during the day as well as the expected development of avalanche activity over the next 
24 hours. 
The model takes as inputs different meteorological and snowpack variables influencing 
the release of avalanches in the region (a list of the available variables is given in table 5.1 
in section 5.1). A historical database starting in 1991 is then searched. The outputs consist 
of the values taken by the same input variables during the 10 most similar recorded days 
(Euclidean distance of equation (2.1) of page 7 as a dissimilarity measure). Additionally, 
the spatial locations of the documented avalanche events occurring during these days are 
also shown on a geo-referenced map. Hence, both the causes, in terms of weather/snowpack 
conditions, and the consequences, in terms of possible avalanche events, are available to the 
forecasters.
4 1 Introduction 
The model developers did not use subjective weighting of the inputs based on forecasters’ 
experience, but instead chose to implement an automated procedure to find the optimal 
weights. The optimization of variables’ relevance has been carried out by means of genetic 
algorithms using several fitness metrics to evaluate the ability of different sets of weights 
to correctly forecast avalanche and non-avalanche days. For both the optimization of the 
parameters and the verification (testing) of the model, on a given day, a forecast of avalanche 
activity is produced if 3 of more of the 10 nearest neighbors were avalanche days. If this 
threshold is not reached, the day under examination is forecasted as safe. 
The batch testing of the model (assessing the generalization error by cross-validation) 
has been carried out on 1323 days (actually 1005, because of no-visibility days), covering 
the years from 1991 to 2002, in order to evaluate the agreement of model forecasts with the 
observations. The results can be summarized with binary confusion matrices (contingency 
tables) for which several categorical statistics can be computed (see section 3.4). The best 
prediction performances were obtained with an optimization via either Hanssen and Kuipers 
discriminant or Unweighted average accuracy, leading to an Overall Accuracy of 0.83 and to 
a Hanssen and Kuipers discriminant value of 0.61. The models correctly forecasted slightly 
more than 200 avalanche days, with only approximately 60 misses and 115 false alarms. 
The Cornice application produced quantitative results considered very encouraging by the 
authors. However, its main utility is clearly recognized as a support for the forecasters in the 
information gathering and hypothesis testing process allowing avalanche danger assessment. 
Support Vector Machines model 
This temporal avalanche forecasting approach has been revisited by Pozdnoukhov et al. in 
[29] who applied machine learning methods to the Lochaber dataset with the purpose of 
increasing the accuracy of the predictions. In this work, the performing supervised classifier 
called Support Vector Machine is used firstly to improve the discrimination ability in the 
temporal predictive task (avalanche days vs. non-avalanche days), then is applied to make a 
preliminary extension to a spatial avalanche danger forecasting. 
The adopted methodology was centered on a pure data-driven approach starting with 
the selection of the relevant features to be employed using the automated procedure called 
Recursive Feature Elimination (see section 4.1). An initial set of 44 variables, comprised of 
combinations of the variables measured on the slopes (current day features, previous days 
features, expert features), was filtered out by retaining the 20 most valuable non-redundant 
features for the classification task. 
After SVM parameters optimization by cross-validation on the winters from 1991 to 2001, 
a test of the model performance was carried out on the 712 days of observations in the 
period 2001-2007. The method showed a satisfactory ability to detect avalanche days: the 
Overall Accuracy reached 0.86 whilst the Hanssen and Kuipers discriminant scored 0.64. A 
comparison with nearest neighbors methods applied on the same dataset demonstrated a 
slight superiority of the SVM technique. 
Furthermore, a transform of the SVM decision function into a probability (see section 4.2 
for details about the method) allowed a reliable interpretation of the outputs of the model in 
terms of the likelihood of an avalanche occurring on a given day (application to 2003/2004
1.2 Prior work on data-driven statistical avalanche forecasting 5 
winter). 
Given the well-known ability of this machine learning method to deal with high-dimensional 
data, an additional set of spatially varying features such as altitude, slope or aspect was added 
to the vector describing the avalanching conditions on a specified day. The purpose was to 
characterize the local situation at each avalanche path of the Lochaber region by providing 
the model with examples of about 700 avalanche events whose spatial attributes have been 
documented. The authors have then been able to extrapolate the avalanche activity indicator 
over the whole study area thanks to a digital elevation model (DEM). 
Such a spatio-temporal approach has been presented as an early result of a procedure 
needing refinements and further work aimed at the assessment of the validity of the results. 
This initial work as well as some improvements (spatial distribution of some meteorological 
features such as wind fields, etc.) already put into practice by the cited researchers (see [28]) 
is taken as a starting point for this thesis.
Chapter 2 
Machine Learning 
The broad research field of machine learning, rapidly developing in the last decades, is often 
described as a subtopic of computer science whose outlining concepts and ideas derive from 
closely related domains such as statistics and artificial intelligence. 
The notion of machine learning can be presented, in an overall view, as a collection of 
techniques that are able to “learn” the dependencies existing in the data affecting a given 
predictive task from examples (tasks description in section 2.1). The different methods are 
designed so that the learning procedure takes place in an automatic and data-driven way. 
This means that, in general, no human prior knowledge or assumptions concerning data 
probability distributions are used during the process. For obtaining a good foundation on 
the topic and for additional information [6] is suggested. 
The fields, with related real-world applications, concerned by these state-of-the-art tech-niques 
are countless. The ones which have been earlier involved include bioinformatics/biom-etry 
(biosequences analyses), chemistry (cheminformatics/chemometrics), medicine (diag-noses), 
data mining (financial data), web and text mining (text or webpages categorization), 
speech and hand-written character recognition, etc. 
Nevertheless, the development of the research in the area of environmental sciences took 
place only later on with applications in domains such as spatial interpolation, remotely sensed 
images classification, etc. (see [17], [18]). In fact, the geo-spatial phenomena modeling would 
benefit a lot from the operational use of the latest breakthroughs occurred within the machine 
learning community. Avalanche forecasting in particular, the topic of this thesis, is one of 
the geosciences domains for which machine learning methods show much promise [29]. 
2.1 Supervised learning vs. unsupervised learning 
Machine learning methods may be classified into the categories of supervised and unsupervised 
learning. 
Supervised learning can be thought of as a process by which a learning machine is guided 
throughout a training procedure to learn the input/output relationships existing in the data 
set. These examples are called the training data. Each individual sample/example is de-scribed 
by an input vector x belonging to RN, usually referred to as the input, and presents 
6
2.1 Supervised learning vs. unsupervised learning 7 
a related known output y. This means that each sample can be represented as a vector in an 
N-dimensional space (N variables). Depending on the type of y value one can define the task 
as a regression problem or a classification problem (pattern recognition). In the first case the 
output associated with a given input is a real value y ∈ R. In the second case, with which 
this thesis will be dealing with, output values are discrete, resulting in a binary classification 
task if y ∈ {−1, 1} or in a m multi-class classification task if y ∈ {1, 2, . . . ,m}. The learning 
machine, after having seen all L training examples {(x1, y1) , . . . , (xL, yL)}, then provides an 
estimate of the original function y = f(x) mapping the inputs to the output domain. 
The other learning approach may be termed unsupervised learning. In this case the 
learning machine is not provided with the outputs y and the method goal is to extract 
information about the process which generated the data. The main types of this kind of 
learning are clustering (also known as cluster analysis) and density estimation. The first one 
listed is concerned with the grouping of the data points into clusters whose members have 
similar characteristics, without knowing their true class labels. Density estimation methods 
attempt to model the underlying probability distribution of a certain observed phenomenon. 
Combinations of the supervised and unsupervised domains are also possible resulting in 
semi-supervised learning, an approach where labeled and unlabeled examples are provided at 
the same time to the learning machine. A summary of these hybrid techniques, implemented 
to make use of all the available information in order to improve the predictive model, can be 
found in [5]. 
The present thesis is mainly dealing with the supervised approach for binary classification 
problems. The chosen learning system, and its associated tools, is known as Support Vector 
Machines (SVMs). The technique is part of the subfield of machine learning referred to as 
kernel methods [35]. This supervised classification method based on the so-called Support 
Vectors will be detailed in chapter 3. In [11] the reader will find a comprehensive description 
of other supervised learning techniques. These include Fisher’s Linear discriminant analysis, 
Logistic regression, Decision trees, Multi-Layer Perceptrons, Probabilistic Neural Networks, 
k-Nearest Neighbors, etc. The latter will be discussed in the next subsection (2.1.1) since it 
is a benchmark method widely used in avalanche forecasting. 
2.1.1 Nearest Neighbors for classification 
The technique called k-Nearest Neighbors (k-NN) is probably the most intuitive method to 
solve a classification problem. One can reasonably think that similar inputs x, in other words 
examples described by variables taking analogous values, will possess, in most of the cases, 
the same output class label y. This will lead to a decision about the class membership of a 
new point x based on its Euclidean distance (see equation (2.1)) to the training samples xi. 
The cited dissimilarity measure between sample u and v is computed as 
dist(u, v) = 
vuut 
XN 
d=1 
(ud − vd)2, (2.1) 
where d is the variable index.
8 2 Machine Learning 
In order to predict the class label y of the vector currently under consideration, a ma-jority 
vote is set up between the k-nearest examples (k smallest distances) found in the 
N-dimensional input space. 
With fixed distance measure, the only parameter to tune to get the optimal accuracy 
in the class label assignments is the number k of neighbors to include in the decision vote. 
Essentially, choosing a low value of k corresponds to assuming that data are not corrupted 
by noise (structured dataset), so that a close correspondence can be established between the 
training vectors we dispose and the new ones whose label y should be forecasted. On the 
contrary, choosing a large k in most cases means that we believe that the configuration of the 
training examples is much unstructured, leading to a tricky input/output matching. This 
will give rise to a decision process involving a larger set of neighboring examples, approaching 
a simple general majority vote when k tends to N. 
The approach presented here provides good results particularly for low-dimensional datasets. 
Due to this success, as well as its appealing logic, k-Nearest Neighbors is often used as a refer-ence 
technique. On the other hand, when dealing with many variables, this algorithm suffers 
the so-called curse of dimensionality. In a high-dimensional input space, often, new samples 
whose labels are to be predicted by looking at their neighborhood are found to be equally 
far from the all training inputs, precluding any reliable prediction. 
2.2 Statistical Learning Theory 
In the domain of machine learning, Statistical Learning Theory [39], also known as Vapnik- 
Chervonenkis theory, first developed by V. Vapnik in the 1960-70, provides a good framework 
for so-called predictive learning. The main goal of this theory is the optimal assessment of a 
model according to a trade-off between its ability to honor the available information and its 
complexity. 
As stated in section 2.1, a supervised learning model, at the end of the training pe-riod, 
retains a function executing the mapping y = f(x), typically called decision func-tion 
for a classification problem. This function should be chosen from a set of functions 
F = {f(x,), ∈ }, where  represents a vector of parameters selected from the set . 
According to Vapnik’s concepts, the criteria used to evaluate the goodness of the choice of 
a given function f(x), in other words the similarity to the unknown target function that 
depicts the actual input/output dependencies, is the following risk functional, called the 
expected risk: 
R() = 
Z 
Q(y, f(x,))dP(x, y), (2.2) 
where Q(y, f(x,)) is a task-defined loss function and P(x, y) is the unknown joint proba-bility 
distribution of the examples. As can be intuitively understood the risk should be as 
low as possible so that our goal is to minimize the expected average loss (2.2). 
Reviewing the two main learning problems already mentioned (omitting clustering and 
density estimation) let us introduce the loss function most commonly used in pattern recog-
2.2 Statistical Learning Theory 9 
nition: 
Q(y, f) = 
( 
0 if f(x) = y 
1 otherwise. 
(2.3) 
For such a loss function, the resulting expected risk is nothing but the probability of a 
classification error. 
In the domain of regression problems the aim is to minimize the differences between the 
actual output value y and the predicted one f(x) for every example. This is translated into 
mathematical terms, in most cases, by means of the squared loss function 
Q(y, f) = (y − f(x))2. (2.4) 
2.2.1 Empirical Risk Minimization 
Once the principles allowing us to evaluate the performance of a learning machine have been 
defined, Statistical Learning Theory reminds us that, in fact, the distribution P(x, y) of 
equation (2.2) is unknown so that the only known input/output pairs are those of the given 
finite set of examples. The first thought is to approximate the theoretical risk functional by 
an empirical one simply computed on the training examples as 
Remp = 
1 
L 
XL 
i=1 
Q(yi, f(xi,)), (2.5) 
where L is the number of training samples. A minimization of the function, the Empirical Risk 
Minimization, is then carried out in order to select the best set of parameters . However, 
such a choice is strongly dependent on the examples provided to the learning machine for 
training. As discussed in more detail in section 2.3, it is possible to partially circumvent this 
drawback by using a cross-validation methodology or by splitting the initial dataset into 2 
parts (use of an independent set of data). Additionally, always in the same section it will be 
explained that when aiming at evaluating the overall performance of the learning machine 
another set of examples is required. 
2.2.2 Structural Risk Minimization 
In the theoretical framework of Statistical Learning Theory, with the purpose of considering 
the ability of a model to extend the learnt relationships to unobserved new data, the notion 
of Structural Risk Minimization is introduced. 
Essentially, the idea is to place an upper bound for the expected risk (2.2) which varies 
according to the empirical risk and a defined confidence interval such that 
R() ≤ Remp() + 
s 
h ) + 1) − log(  
h(log( 2L 
4 ) 
L 
, (2.6) 
where L is the number of training samples and h is the so-called Vapnik-Chervonenkis di-mension 
(VC-dimension) of the function used [39]. The resulting inequality, which holds 
with probability 1 − , reports a particular bound valid only for the classification case.
10 2 Machine Learning 
The quantity h needs some further enlightenment because it is one of the main concepts of 
Vapnik’s theory. For a binary classification problem, h can be interpreted as the maximum 
number of samples for which a class-consistent partitioning can be achieved using the set 
of functions. A two dimensional data set consisting of 3 vectors can always be separated 
with a linear function, no matter which is the labeling of the points. A difficulty occurs 
if the samples to shatter become 4: a chessboard-like setting will forbid any valid linear 
separation. Finally, we can state that linear decision functions in RN , hyperplanes of the 
form f(x) = wx + b, possess a VC-dimension of N + 1. In comparison to this, a polynomial 
function of degree 2 applied in R2 has a VC-dimension of 4 and, as borderline case, for the 
function f(x) = b sin(wx) this quantity is equal to infinity (high frequency for a large kwk, 
allowing the separation of every possible configuration of points). 
Looking at equation (2.6), it may be seen that the expected risk is minimized when the 
confidence interval, the second term in the right side of the inequality, is kept small by a 
low h 
L ratio. By the mentioned inequality, a function with a large VC-dimension h which 
perfectly fits a small number of data points L will result in a large expected risk since there 
is an overfitting. Such a complex model will likely lead to an important generalization error. 
Figure 2.1 illustrates how the bound on risk varies depending on model complexity. 
Figure 2.1: Bound on risk varying according to the confidence interval and the empirical risk 
associated to sets of models of increasing complexity. After [39]. 
To summarize, the Structural Risk Minimization principle provides a theoretical frame-work 
for achieving the optimal trade-off between the classification accuracy on training data 
and the capacity of the set of functions selected. Later on, in subsection 3.1.4 of chapter 3, 
we will have a look to the concrete means the SVM algorithm supplies to handle this kind 
of issue. 
Next section illustrates the general procedure adopted when using a supervised learning 
approach.
2.3 Model selection and model assessment 11 
2.3 Model selection and model assessment 
The preceding sections have discussed how Statistical Learning Theory allows the evaluation 
of the performance of a model with respect to its complexity. When one is concretely applying 
a supervised learning classification algorithm there are several practical considerations that 
need to be respected in order to properly use the method. 
First, the model selection step is crucial. The fact that the empirical error (training 
error) is computed on the training examples given to the learning machine should be taken 
into consideration when one is proceeding with the choice of the optimal parameters. A 
model that closely or perfectly fits noisy or non-representative training data (see example of 
figure 2.2), is said to overfit (in opposition to a too simple model giving rise to the situation 
called underfitting). Overfitting results in a poor generalization ability of the system when 
dealing with new data. It is required that the tuning of the parameters defining the model 
is carried out on an independent data set (different from the training one). A set of labeled 
examples called the validation set is extracted from the original data and held separate from 
the training data subset in order to compute the classification quality measures (validation 
error, etc.). Predictions of class memberships are performed on the validation set ignoring 
the actual known class label, so that the agreement between the true and predicted class 
assignments could then be checked. An optimization process allows the user to determine 
the best parameters for the classification task. 
Figure 2.2: Example of an overfitting situation for a binary classification problem. The green 
discriminating boundary perfectly separates red and blue points by overfitting this training 
data. The classifier shown in black is allowing some training errors but will then be able to 
predict in a more robust way the class labels for a new set of data. 
Another splitting of data is mandatory if one desires to assess the generalization error of 
the selected model (model assessment). An independent test set should be used, when at all 
possible, to assess the true performance of the model. The performance is this way estimated 
on some independent data, reproducing the future behavior in a new situation. In fact, it is 
not fair to report the performance obtained on the previously used validation set as a model 
success measure because the learning machine is biased favorably to this data (parameters 
perfectly tuned for this set) [17].
Chapter 3 
Support Vector Machines for 
classification 
This chapter will focus on the learning machine that is at the core of almost every step of the 
analyses performed in this thesis. The system implementing in an efficient and robust way the 
training of a classifier of the supervised type is Support Vector Machines (SVMs). Moreover, 
SVMs adhere to the guidelines provided by the Statistical Learning Theory discussed in 
section 2.2 of the same chapter. Section 3.1 will examine how and why a linear decision 
function can optimally be used as a foundation for the classification task when applied in a 
high dimensional space induced by the kernel expansions delineated in section 3.2. 
3.1 Large margin linear classifier 
3.1.1 Optimal separating hyperplanes 
When dealing with a problem where different objects have to be divided in two categories 
by placing a discriminating boundary, the most intuitive option is to draw a separating line. 
This is exactly the principle applied by SVMs. 
More generally, in a N-dimensional space, the line becomes a hyperplane f(x) = wx+b. 
The input vector x of dimension N ∈ RN is multiplied by a weighing vector w which needs to 
be optimized along with the scalar b. In 2D (2 variables x1 and x2 describing the examples) 
the resulting function gives an equation of a plane of coordinates (f(x), x1, x2). If a horizontal 
plane is defined at the height of the level curve f(x) = 0 linearly separating the data points 
and if these vectors are labeled following the sign of the function f(x) it means that they are 
classified either in the positive class if lying above the f(x) = 0 surface or, otherwise, in the 
negative class (below the horizontal plane). 
In order to construct an optimal hyperplane for a linearly separable case, let us define 
some strict conditions for the class-labeling task it carries out. For the training dataset, the 
12
3.1 Large margin linear classifier 13 
values of the decision function f(x) should respect 
wxi + b ≥ 1, if yi = 1 
wxi + b ≤ −1, if yi = −1. (3.1) 
A positive sample (yi = 1) should therefore be associated with a decision function strictly 
greater than or equal to 1 and, on the other hand, a negative input (yi = −1) should be 
given a value less than or equal to −1. These two parts of equation (3.1) can be merged into 
yi(wxi + b) ≥ 1. (3.2) 
This formulation tells us that there should not be any training vector lying in the region 
where the hyperplane takes values between +1 and −1 and that only few points would lie 
exactly on the level curves of height +1 or −1. 
As it can be seen in figure 3.1, the samples located on the level curves are called support 
vectors (SVs) and the region between the positive one (f(x) = 1) and the negative one 
(f(x) = −1) is referred to as the margin of width . Obviously, the decision boundary 
between the two classes becomes the hyperplane f(x) = wx + b = 0. 
w x + b = 0 
Figure 3.1: Geometrical representation (2D) of the location of SVs and the consequent class 
margin placements. Following [19]. 
The goal of a classifier is to generalize the rules learned from the training data to situations 
where new instances have to be classified. Thus, if one tries to place the separating hyperplane 
in such a way that the most of the new data points will be found on the correct side of the 
class boundary, the solution consists in looking for the largest possible margin. 
The small margin hyperplane visible on the left side of figure 3.2 is correctly splitting 
the training points (solid colored marks) of the two classes (circles vs. crosses) but when 
testing examples (grey marks) are introduced it reveals poor generalization ability (many 
misclassification errors). On the contrary, the large margin obtained on the right side is 
robust and is more likely to classify the new samples correctly. 
The width of the margin  can be easily computed as 
 = 
w 
kwk 
(x+ − x−) = 
wx+ − wx− 
kwk 
= 
(1 − b) − (−1 − b) 
kwk 
= 
2 
kwk 
, (3.3)
14 3 Support Vector Machines for classification 
x2 
x1 x1 
x2 
Figure 3.2: The introduction of the testing samples (in grey) leads to many classification 
errors when the margin is not optimized (left figure). Modified after [19]. 
where w is the vector defining the hyperplane, x+ is one of the positive class SVs (contributing 
to the margin definition) and x− is a negative class SV. 
As shown by equation (3.3), the goal is to minimize kwk. This intuitive minimization 
problem is theoretically justified by the insights of the Statistical Learning Theory [39]. It is 
stated that the complexity h of the set of functions is bounded by 
h ≤ min(R2 kwk2 ,N) + 1, (3.4) 
where R is the radius of the smallest sphere enclosing all the training vectors belonging to RN. 
Consequently, a large margin, implying a small kwk, contributes at keeping low, pertinent 
and thus efficient the capacity of the model. 
3.1.2 The optimization problem 
In order to accomplish the training of the machine, we are faced with an optimization problem. 
SVMs provide a performing algorithm to maximize margin (3.3) whilst respecting constraints 
(3.2). 
Taking advantage of the concepts of the constrained optimization paradigm (Lagrangian 
theory), developed by Lagrange at the end of the 18th century, and the extensions provided in 
the 1950s by Kuhn and Tucker, the following results can be derived. After having introduced 
the Lagrange multipliers i ≥ 0 associated with the training inputs i one can express the 
so-called primal formulation of the optimization problem (primal Lagrangian) as 
LP = 
1 
2 kwk2 − 
XL 
i=1 
iyi(wxi + b) + 
XL 
i=1 
i. (3.5) 
Since we are looking for the maximal margin (minimal kwk), the task consists in mini-mizing 
(3.5) with respect to w and b. Because of the convexity of function LP , this is done 
by searching for the values at which the associated derivatives (3.6) vanish. 
@LP (w, b,) 
@b 
= 0, 
@LP (w, b,) 
@w 
= 0, (3.6)
3.1 Large margin linear classifier 15 
The provided results 
XL 
i=1 
iyi = 0, w = 
XL 
i=1 
iyixi, (3.7) 
can be substituted to the primal form to get the dual formulation of the problem 
LD = 
XL 
i=1 
i − 
1 
2 
XL 
i,j=1 
ijyiyjxixj . (3.8) 
At this point one finds the parameters i by maximizing (3.8) with respect to these 
same i subject to the constraints 
PL 
i=1 iyi = 0 and i ≥ 0, i = 1, . . . ,L. The cited task 
actually consists of a quadratic programming problem (quadratic objective function with 
linear constraints). The solution of the optimization problem allows the final SVM decision 
function to be formulated as 
f(x) = 
XL 
i=1 
yiixxi + b. (3.9) 
The predicted class label (+1 or −1) is simply assigned following the sign of (3.9) when 
dealing with a binary classification task. If the input vectors belong to more than 2 classes, the 
solution consists of combining several binary classifiers with either a one class-vs-all approach 
or a one-vs-one approach. This multi-class extension of SVMs is accurately described in [34]. 
A comprehensive and clear description of the optimization problem and its resolution, 
summarized in this section, can be found in [34] and [9]. 
3.1.3 Support Vectors and their relevance 
The main outputs of the training procedure of the SVM are the values i. Looking at 
equation (3.9), one can see that these coefficients are the weights given to each training 
vector xi. However, only a small proportion of them receive a non-zero i. Thus, only a 
subset of the initial training set is truly influential in the evaluation of the decision function. 
These informative points are called support vectors and, conforming to the situation 
depicted by figure 3.1, they lie on the margin (positive or negative side according to their 
label yi). For the support vectors, the inequality (3.2) turns into yi(wxi+b) = 1. Given that 
such a subset is the only fraction of the data that participates in the prediction, the same 
result would be achieved if all the rest of the points were withdrawn from the training set 
before training the system. 
3.1.4 Soft margin adaptation 
In subsection 3.1.1, figure 3.1 shows a linearly separable situation where the two classes are 
not overlapping: the training examples are described by inputs that can be partitioned by a 
hyperplane. Clearly this is an ideal situation one will rarely be dealing with. In reality, usually 
data are noisy so that it is impossible to avoid training errors when drawing a separating 
line. 
These considerations lead to a slightly different formulation of the large margin classifier, 
the soft margin classifier. The “hard” margins presented with (3.2) on page 13 are “softened”
16 3 Support Vector Machines for classification 
by means of the slack variables i. The intuition consists of letting noisy training samples 
(lying outside the class level curve +1 or −1) fulfill the requirements as 
yi(wxi + b) ≥ 1 − i. (3.10) 
In this way, positive (negative) vectors can be associated with a decision function which 
does not have to be strictly larger (smaller) than 1 (−1). For example, a sample lying on the 
wrong side of the decision boundary wx + b = 0 will be given a i  1 so that it will then 
be treated as a coherent class member. Figure 3.3 shows for which inputs the slack variables 
have to be introduced. 
Figure 3.3: Slack variables i are assigned to noisy samples lying outside their class margin. 
Following [19]. 
In order to keep a low empirical error (2.5) one should, of course, force the algorithm 
to assign non-zero i values to as few as possible of the training samples. Therefore, in the 
optimization process, the initial functional (3.5) that has to be minimized is modified so that 
its first term is substituted by 
1 
2 kwk2 + C 
XL 
i=1 
i. (3.11) 
The left term in (3.11) is the one the procedure had to minimize for the finding the largest 
possible hard margin. The added right term, which also has to be minimized, contributes to 
assess the number and relevance of misclassification errors in the training set. 
The weighing constant C (cost) allows the user to control these kind of errors during the 
training phase and conveys the confidence the user has in the data. With a large value of 
C, implying the belief that the dataset is not noisy, every misclassified example is heavily 
penalized, leading to a very small training error. The drawback is that such a great impor-tance 
conferred to the training data will give rise to the overfitting phenomena due to the 
complexity of the applied model. From this point of view, the inverse of C could then be in-terpreted 
as a regularization constant. Furthermore, parameter C turns out to be, concerning 
the quadratic programming problem, the upper bound for the i, so that 0 ≤ i ≤ C, ∀i. 
So, the minimization of the first term of (3.11) corresponds to lowering the upper bound 
for the VC-dimension (see equation (3.4)) controlling the confidence interval described in
3.2 Kernel expansion 17 
equation (2.6) on page 9. The second term of this functional mainly controls the empirical 
error which appears in (2.6). Finally, both terms contribute to keep the expected risk to low 
values since also the second term of (3.11) is able to suggest the use of a simple model (small 
h) if one chooses a low value of C. 
3.2 Kernel expansion 
3.2.1 The principle 
Up to this point, we have seen how a linear decision function can be optimally applied to 
classify our examples with binary labels. When dealing with challenging data sets where the 
input/output relationships are non-linear, we need a more clever way to discriminate the two 
classes. 
The bright idea is to map the dataset into a space of higher dimensions and then perform 
the well-known linear separation on the transformed data, rather than applying complex 
decision functions directly on the initial data set. This is possible since we have seen in 
equation (3.9) on page 15 that, for the linear case, the decision about the class membership 
of a new sample x depends only on a dot product between this input vector and all the training 
samples xi. Thus, the intuition, called the kernel trick, is to substitute the dot product with 
a kernel function K(·, ·) involving the same two vectors, so that the final decision function 
changes to 
f(x) = 
XL 
i=1 
yiiK(x, xi) + b (3.12) 
This is the final formulation of the decision function for a classification task carried out with 
SVMs. 
The function K(·, ·), for ease of simplicity referred to as the kernel, carries out the higher 
dimensional space mapping, not directly by generating the longer coordinates vector out 
of the two samples, but in an implicit way. The result of the dot product involving the 
mapped vectors, (x) and (xi), is equal to the output of the kernel computed with the 
low-dimensional vectors as inputs. 
x · xi7→ (x) · (xi) = K(x, xi) (3.13) 
Using the machine learning vocabulary, we refer to the original space as the input space, 
whilst we name the kernel-induced one the feature space. 
3.2.2 A concrete example 
As a demonstration, one can try to compute the polynomial kernel of degree 2, defined as 
K(x, xi) = (x·xi +1)2, for a couple of inputs belonging to R2, u = (u1, u2) and v = (v1, v2). 
One finds out that, as shown by equalities (3.14), the application of such a kernel on u 
and v results in the same sum of terms that one would have obtained if a simple dot product 
between two high-dimensional mappings of the original vectors. The mapping that we refer
18 3 Support Vector Machines for classification 
to is the following: 
(u) : (u1, u2)7→ (u21 
, u22 
,√2u1u2,√2u1,√2u2, 1), 
resulting in a feature space of 6 dimensions, i.e. (u) ∈ R6. 
K(u, v) = K(u · v + 1)2 (3.14) 
= u21 
v2 
1 + u22 
v2 
2 + 2u1v1u2v2 + 2u1v1 + u2v2 + 1 
= (u21 
, u22 
,√2u1u2,√2u1,√2u2, 1) · (v2 
1, v2 
2,√2v1v2,√2v1,√2v2, 1) 
3.2.3 Valid kernel functions 
The rapid developments in the field of kernel methods have brought a wide range of different 
kernel functions that can be successfully applied. However, it is important to recall that not 
every function involving two vectors constitutes a kernel. In fact, valid kernels have to fulfill 
the so-called Mercer’s conditions (see [9]). 
In a few words, these constraints must be met for a selected function K(x, xi) to act as 
a kernel associated with the desired feature space (output equal to the dot product of the 
mapped vectors). Strictly speaking, this means that the kernel matrix K = 
 
K(xi, xj) 
n 
i,j=1 
has to be symmetric and positive semidefinite (possess non-negative eigenvalues). Matrix K, 
also known as the Gram matrix, has as elements the outputs of the kernel function for every 
pair of input vectors (xi, xj). 
Additionally, user defined kernel functions can be created by multiplying or adding valid 
kernels since the resulting functions also respect Mercer’s conditions. If K1(·, ·) and K2(·, ·) 
are kernels 
• aK1(·, ·) + bK2(·, ·) for a, b  0 
• K1(·, ·)K2(·, ·) 
are valid kernels as well (proof available in [35]). These properties allow us to construct some 
composite kernels which will then be useful to improve classification performance (see [4]). 
Here we present a list of the most frequently used applicable kernel functions: 
• Linear kernel: 
K(x, xi) = x · xi (3.15) 
• Polynomial kernel: 
K(x, xi) = (x · xi + 1)p , p ∈ N (3.16) 
• Gaussian RBF kernel: 
K(x, xi) = e− 
(x−xi)2 
22 ,  ∈ R+ (3.17) 
The first item, the linear kernel, corresponds to the situation where the kernel trick has 
not been applied, while the second one illustrates the general form (degree p as option) of
3.2 Kernel expansion 19 
the polynomial kernel brought into play in subsection 3.2.2. The last kernel mentioned, the 
Gaussian Radial Basis Function kernel, will be discussed in more detail in the next subsection. 
It is interesting to point out that the choice of the kernel type also allows the user to 
control the complexity of the model (bound on risk (2.6) presented on page 9) since the VC-dimension 
h also varies according to the feature space into which the inputs are mapped. In 
fact, the linear separation performed by the SVM algorithm is executed in the N-dimensional 
feature space resulting in a value of h = N + 1 (see subsection 2.2.2). 
3.2.4 Details on the Gaussian RBF kernel 
Among the available kernel functions, a user’s choice often falls on the well-known Gaussian 
RBF kernel due to the simple geometrical interpretation it offers. As one can see from 
formula (3.17), the numerator of the argument of the exponential function is nothing but a 
dissimilarity measure between vector x and vector xi. In fact (x − xi)2 = kx − xik2, is the 
squared Euclidean distance between the examples computed in the input space. 
By taking the exponential of its negative value, one assigns a large value only to close 
samples. One will notice an exponential decrease starting from a summit of 1, the output 
value when evaluating two identical vectors. Since the outputs K(·, ·) are included in (3.12), 
the latter are then weighing the training samples xi in the sum over all the L labeled instances. 
The labels yi (values of +1 or −1) associated with the inputs will then have different influences 
in the final decision function yielding a class membership for the new data point x. 
Moreover, the parameter , appearing in the denominator, controls the bandwidth of the 
Gaussian surface centered on vector x, the object of the prediction. Figure 3.4 shows how 
weights vary according to the kernel width , illustrating the smoothing effect of a large 
value of it. In fact, a small bandwidth lets only training vectors xi’s close to x in the input 
space contribute significantly to the final decision function. 
Gaussian kernel with sigma = 0.5 
−3 
−2 
−1 
0 
1 
2 
3 
−3 
−2 
−1 
0 
1 
2 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
3 
x i,1 
x i,2 
K(xi, 0) 
Gaussian kernel with sigma = 1 
−3 
−2 
−1 
0 
1 
2 
3 
−3 
−2 
−1 
0 
1 
2 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
3 
x i,1 
x i,2 
K(xi, 0) 
Figure 3.4: The Gaussian RBF kernel function K(x, xi) with x = (0, 0) and xi = (xi,1, xi,2) 
for a varying xi is plotted for bandwidths  = 0.5 and  = 1. 
A peculiarity of this kernel is that, contrary to the other two presented here, the similarity 
between the input vector x and the training inputs xi is measured as an Euclidean distance
20 3 Support Vector Machines for classification 
and not in terms of an angle in the input space. The latter is the case when one evaluates 
the dot product (linear or polynomial kernels) which can geometrically be interpreted as the 
cosine of the angle between the 2 vectors. 
3.3 Parameters tuning 
As stated in subsection 2.2.1, the parameters (usually referred to as hyper-parameters as 
well) defining a good model have to be chosen through the assessment of the quality of the 
predictions on an independent data set. 
Often, when approaching such task, a cross-validation approach is chosen. This pro-cedure, 
precisely named leave-k-out cross-validation, consists in training the model on all 
the points of the training set except for a subset formed by k randomly chosen vectors. A 
prediction of the output is then carried out for these points, allowing, in the case of classifi-cation, 
a labeling comparison. The procedure is repeated until each training vector has been 
provisionally removed from the main set (partitioned n/k times). 
However, this procedure requires a computationally intensive effort when working with 
SVMs. In fact, such an algorithm usually necessitates the classifier to be retrained each time 
a new subset of points has been left out. This results in an approach applicable with poor 
success to big data sets. 
Consequently, as pointed out in section 2.3, the initial training set may be split in two parts 
so that a validation set could then be used to evaluate the predictions based on the learned 
input/output relationships. A popular validation set/training set partition is 25%/75% of 
the original training data. 
The hyper-parameters of a SVM model that have to be tuned are the cost C and the kernel 
parameters ( for Gaussian RBF kernel, p for polynomial kernel, etc.). Because no direct 
analytic function links up the variations of these parameters to the changes in the chosen 
performance measure, a grid search approach must be chosen. In the case of Gaussian RBF 
kernel, this means that a measure like the classification error (the wide range of performance 
measures is depicted in section 3.4) is then computed for a set of different values of C and 
 spanning a user-defined space. We then look for the values optimizing the classification 
performance (lowest validation error, highest accuracy, etc.). A minimal validation error 
should correspond to a low percentage of SVs. Effectively, too many SVs being identified 
after the training procedure signifies a warning of an overfitting caused by a complex model 
(a too small bandwidth  can be an example). 
In some particular cases where class counts are unbalanced (usually many more negative 
examples than positive ones), it is possible that the SVM decision function threshold f(x) = 0 
(see subsection 3.1.2) is not the optimal one. In such situations, a threshold tuning can be 
carried out as well. This results in a significant improvement of the classifier performance, 
in terms of the selected performance measure score. However, if satisfactory results are 
not obtained in this manner, an additional effort is required in order to deal with such a 
nonstandard situation. A well-suited procedure to apply in these cases is presented in [22]: 
the authors propose a modification of the cost function of the SVM.
3.4 Binary classification quality measures 21 
Predicted (Forecast) 
Class +1 (Yes) Class −1 (No) row totals 
Actual (Observed) 
Class +1 (Yes) hits misses observed yes 
Class −1 (No) false alarms correct negatives observed no 
column totals forecast yes forecast no total 
Table 3.1: Confusion matrix for binary predictions related to the forecasting of an event. 
3.4 Binary classification quality measures 
In assessing the classification performance of a supervised learning model, particularly when 
dealing with a binary classification task, a broad range of categorical statistics is available 
(see [10]). This section will describe the main measures currently used for the case where 
model predictions are related to the forecasting of an event (occurrence vs. non-occurrence). 
Predictions results may be organized into the 2-by-2 confusion matrix illustrated by table 
3.1. 
Given that an event may either be observed or not, and then either forecast or not by 
the model, 4 possible situations can be encountered: the observed event can be correctly 
forecast (hit or true positive) or not detected (miss or false negative), while a non-event can 
be incorrectly forecast (false alarm or false positive) or correctly not notified (correct negative 
or true negative). 
The following ratios are then often used: 
True Positive rate (hit rate) = 
hits 
hits + misses 
(3.18) 
False Positive rate (false alarm rate) = 
false alarms 
false alarms + correct negatives 
(3.19) 
As overall model success measures we can find: 
Overall Accuracy = 
hits + correct negatives 
total 
(3.20) 
Hanssen and Kuipers discriminant = TP rate − FP rate (3.21) 
Heidke Skill Score = 
(hits + correct negatives − exp. correct) 
total − exp. correct 
(3.22) 
Bias = 
hits + false alarms 
hits + misses 
= 
forecast yes 
observed yes 
, (3.23) 
where “exp. correct” is the expected number of correct forecasts due to random chance. 
This value is computed from the theoretical correct forecasts sum totals under independence 
assumption (of actual and predicted classes) using marginal sums as 
exp. correct = 
(forecast yes * observed yes + observed no * forecast no) 
total 
(3.24)
22 3 Support Vector Machines for classification 
The first measure, Overall Accuracy (OA, range: 0 → 1, perfect score: 1), reports the 
number of correct predictions over the total number of points, suggesting whether the model’s 
overall performance is reliable. It becomes a bad statistic if correct negatives (many non-events) 
are predominant since classifying every instance in the class −1 will lead to good 
scores. 
Hanssen and Kuipers discriminant (HK, range: −1 → 1, perfect score: 1), subtracts the 
false alarm rate from the hit rate, indicating the capacity of the current forecasting system 
for discriminating between events and non-events. Therefore, when non-events are the norm, 
this measure is very suitable because the number of false alarms will have less of an influence 
on the model performance assessment. Instead a higher importance is given to missed events 
(appearing in the denominator of (3.18)). This takes an additional importance in the case 
where the 2 types of errors have different costs (e.g. avalanche forecasting). False alarms are 
usually less damaging than misses. 
Heidke Skill Score (HSS, range: −∞ → 1, perfect score: 1) measures the fraction of 
correct predictions after annulling those forecasts which would be correct due purely to 
random chance. 
The last measure, the bias (range: 0 → ∞, perfect score: 1), does not really indicate the 
classification success but tries to inform about over- or under-forecasting, with values tending 
to ∞ and, respectively, to 0. 
When the selected classifier makes class membership decisions depending on scores that 
can be interpreted as the degree to which an example is reasonably a class member (SVMs, 
neural networks, etc.), some interesting graphs involving the cited performance measures can 
additionally be plotted. In fact, the binary classification is executed according to a defined 
threshold resulting in a positive class label if the score is above the threshold t (f(x)  t), 
or in a negative one if the value is lower than t (f(x)  t). 
The first insight is to graphically see how the model success measure changes when the 
class boundary varies, usually plotting the curve constructed with the points (t,measure). 
Moreover, as thoroughly detailed in [12], a Receiver Operating Characteristics (ROC) curve 
can be built. Such a plot is a 2-dimensional graph with FP rate as the horizontal axis and TP 
rate as the vertical axes. It efficiently represents the tradeoff between the costs and benefits 
of the actual classification. In these terms, if we compute the two mentioned rates for the 
classifications obtained with thresholds varying from their minimal to maximal values, we 
will be able to plot a point (FP rate,TP rate) associated with each selected threshold. 
Figure 3.5 shows 2 possible curves in the ROC space. The curve labeled “B” is associated 
with a much better performing model compared to the model that produced curve “A”. The 
reason is that, no matter which threshold is retained at meaningful FP rates, the resulting 
curve is more to the “northwest”, meaning that classifier “B” is producing higher TP rates 
combined with lower FP rates than model “A”. As a matter of comparison, the line joining 
points (0, 0) and (1, 1) corresponds to the strategy of randomly guessing a class label for 
every given instance to classify (if one tries to get more hits by forecasting more positive 
labels it also increases the number of false alarms). 
When looking for the best possible classification, a point in the ROC space, we might take 
Hanssen and Kuipers discriminant as a model success measure since this statistic is nothing
3.4 Binary classification quality measures 23 
Figure 3.5: ROC curves associated to 2 different classifiers (“A” and “B”). After [12]. 
but the difference between the vertical and horizontal axis coordinates, yielding the highest 
value for point (0, 1). However, when comparing two systems in an overall sense, the “area 
under the curve” measure is a better indicator of the average performance of the classifier 
over all possible threshold choices (see [12]).
Chapter 4 
Extensions of the SVMs-based 
approach 
4.1 Feature selection 
Feature selection methods provide the classifier with a smaller subset of variables created 
out of the initial set so that it can work in a lower dimensional input space with only the 
relevant features. This often causes an improvement of the classification accuracy since noisy 
and redundant features are filtered out. Moreover, the application of this kind of algorithm 
provides the analyst with meaningful information about the real influence or utility of each 
input feature used in the classification problem. 
4.1.1 Methods overview 
Many methods have been proposed to select the best features or to reduce the input space di-mensionality. 
They have been reviewed in [15]. The techniques can be divided into categories, 
according to the manner in which they deal with the variables. 
Methods such as Principal Component Analysis linearly combine the original features 
to create new ones. The result is a set of uncorrelated orthogonal variables carrying a 
decreasing amount of information (variance). The user may then select only the largest 
variance components for the classification task, which aids in avoiding overfitting. However, 
no individual feature or features can be ignored since they are all included in the creation of 
the new set. 
The second category contains techniques that consider each initial feature independently, 
without caring about the mutual information between them. Feature ranking with correlation 
coefficients, a simple method described in [16], belongs to this kind of approach. 
Finally one finds the best performing methods, which take into account simultaneously 
all the input variables during the ranking/selection process. This simultaneous consideration 
of input variables results in a selection that is much more appropriate when the chosen 
classifier is a “multivariate” one (SVMs, Fisher’s linear discriminant, etc.). One such method 
is Recursive Feature Elimination, which is explored in more detail in the next subsection. 
24
4.1 Feature selection 25 
4.1.2 SVM-Recursive Feature Elimination 
In [16] Guyon et al. discuss the use of feature ranking coefficients (provided by each discussed 
method) as weights in the linear decision function f(x) = wx + b, where w is the vector of 
feature weights, x is the input and b is a bias value. The inverse reasoning can be applied 
as well, yielding that variable weights multiplying the related inputs can then be used as 
coefficients reporting the relevance of each feature. This latter consideration is exactly the 
motivation that justifies the Recursive Feature Elimination (RFE) procedure combined with 
an SVM classifier. The RFE technique belongs to the broader category of methods named 
wrappers (which select the best features according to a criterion of assessment related to 
the classifier), opposed to those named filters (which select the best features according to a 
criterion independent of the classifier). 
The RFE algorithm details differ when using a linear SVM or a non-linear one. In the 
following part we will first treat the linear case, while the generalization to a SVM classifier 
using a kernel expansion will be discussed as the last topic of this subsection. 
The linearly separable case 
The algorithm for the linear case can be summarized as follows: 
Inputs: Training samples with known class labels (xi, yi) 
repeat until every feature k has been removed 
– train the linear SVM and compute the weighing vector w = 
P 
i iyixi 
(one component per variable) 
– obtain the ranking criterion for each feature k as ck = (wk)2 
– find the feature with the lowest c value 
– remove the feature values from the initial training data 
– update the final ranking list 
end repeat 
Output: ranked features list (first removed → less relevant) 
The interpretation of this procedure is that at every step of the algorithm, after having 
trained the SVM, the least influential feature is removed. It is worth remarking that a ranking 
list is already obtained after the first iteration by sorting in a decreasing order the coefficients 
ck. Anyway, the interest of this feature selection method is that, by running the whole RFE 
procedure, an optimal subset of complementary features is found which may not be the most 
individually relevant [16].
26 4 Extensions of the SVMs-based approach 
Generalization to the non-linear case 
When dealing with a non-linear SVM it is impossible to directly compute the components of 
vector w because the sample xi, included in a simple dot product in equation (3.9), becomes 
here, on the contrary, the input of the kernel function of (3.12). Therefore, the method 
consists of looking for the smallest change in the square of the length of vector w when 
removing feature k. This value, identified with W()2, is not computed directly as the norm 
of w, but as 
W()2 = kwk2 = 
X 
i,j 
TH, (4.1) 
ijyiyjK(xi, xj) =  
where the ’s (forming column vector ) are the weights for each training point found after 
the optimization task, K(xi, xj) is the kernel output (scalar) reporting the similarity between 
the training samples xi and xj and H is a matrix consisting of elements yiyjK(xi, xj), an 
extension of the Gram matrix defined in subsection 3.2.3. 
As proposed by Guyon et al., at each iteration, the feature to withdraw according to the 
final ranking criteria is selected as 
f = argmin 
k
W()2 −W(−k)()2
, (4.2) 
where the notation (−k) denotes that the candidate feature k has not been included in the 
computation of (4.1). Since the norm of the weighing vector w defines the margin  (see 
equation (3.3) on page 13), we select the variable whose removal least changes the distance 
between the strict class boundaries f(x) = −1, +1. 
For computational facilities, at every iteration, when the variable to remove is selected 
from all those available, the ’s are left unchanged and only matrix H is recomputed, with 
every candidate ignored in turn. Moreover, this matrix is computed involving only the 
support vectors since only for these examples is   0. 
The expedients for the extension to the multi-class case of the binary SVM-RFE presented 
here can be found in [36]. 
4.2 Probabilistic SVM output interpretation 
4.2.1 Interpretations for decision support 
A good model should provide a decision support system with values that can be interpreted 
in a meaningful way by users, so that appropriate measures may be taken. The classifier 
presented in this chapter is constructed in such a way that the class membership for a new 
instance is chosen according to the values taken by the final decision function (3.12). These 
values can be suitably transformed by post-processing to yield an a posteriori probability 
(from a categorical to a probabilistic forecast). The interpretation of this kind of probabilities 
is the class membership likelihood of a given example. 
The method that endows a probabilistic output from a SVM model is presented in details 
by Platt in [27]. The following subsections only review the points of this theory which have 
been used in this thesis.
4.2 Probabilistic SVM output interpretation 27 
4.2.2 The sigmoid transform 
Applied to such a case, Bayes’ rule allows us to write the posterior probability P(y = 1|f(x)) 
for a sample x to belong to class +1 as 
P(y = 1|f) = 
p(f|y = 1)P(y = 1) 
p(f) 
, (4.3) 
where f is the associated decision function, p(f) = 
P 
l=−1,+1 p(f|y = l)P(y = l) is its a priori 
probability, p(f|y = l) is the class conditional probability of observing value f and P(y = l) is 
the prior class probability (for class l). All of these probabilities can be empirically computed 
from the histogram estimates of the class-conditional densities. This methodology is preferred 
to a parametric fit of the latter because the popular Gaussian assumption is often violated. 
If a scatterplot of P(y = 1|f) versus f is drawn, one obtains a graphical visualization 
(see figure 4.1) of the positive class membership probabilities conditional to each observed 
SVM output (decision function f). The goal is to fit an analytically described curve to these 
plotted points so that when dealing with a new value f, associated to a new sample, we will 
be able to predict its class +1 likelihood. In particular, it turns out that a sigmoid function 
of the form 
P(y = 1|f) = 
1 
1 + exp(Af + B) 
(4.4) 
is, in most cases, very well suited for modeling such a relationship. A and B are the free 
parameters to tune with A ∈ R− (to ensure monotonicity) and B ∈ R. 
f 
P (y = 1 | f ) 
Figure 4.1: In this example the plus signs indicating posterior class +1 probabilities are 
extremely well fitted by the tuned sigmoid function. Modified after [27]. 
4.2.3 Parameters tuning 
The method proposed in [27] consists of minimizing the negative log likelihood of the data 
set. For every vector its decision function fi is associated to its actual transformed class label
28 4 Extensions of the SVMs-based approach 
ti = (yi + 1)/2 (ti = 0 or 1). We thus aim at minimizing 
− 
X 
i 
ti log(pi) + (1 − ti) log(1 − pi), (4.5) 
where 
pi = 
1 
1 + exp(Afi + B) 
. 
One can think of this minimization as a procedure by which one looks for the best function 
(defined by parameters A and B) to model the posterior probability of a sample to belong to 
its actual class so that this value pi approaches 0 when it is a negative class point (ti = 0) or 
approaches 1 when it is a positive class one (ti = 1). The optimization algorithm proposed 
in the cited paper is issued from the well-known Levenberg-Marquardt algorithm. 
Parameter A controls the slope of the sigmoid whilst B controls its location along the 
horizontal axis. If B is equal to 0, there will be an exact matching between the 0.5 posterior 
probability and the f = 0 decision function threshold (sign of f providing the vector labeling 
decision). 
Furthermore, it is important to note that in order to avoid overfitting, the tuning of the 
parameters should be carried out on a dataset other than the training set used to produce 
the predictions: the validation set. 
4.3 Active Learning with SVMs 
4.3.1 Principles 
In the domain of supervised learning, the interest in the active learning techniques are due 
to their ability to provide the classifier with a good, informative subset of training examples. 
The goal is to let the machine learn the input/output relationships leading to a satisfactory 
classification performance from the least number of labeled input vectors. 
Initially, the classifier disposes of Labeled set of training samples (xi, yi) referred to as DL. 
At the same time, an Unlabeled set of examples (candidate samples ˜x whose class membership 
y is ignored) DUL is available. At each step of the active learning algorithm, the learning 
machine should then be able to select from DUL, without knowing the associated class labels, 
the data point ˆx whose addition to DL, after the identification of the true class label ˆy, will 
lead to the most significant improvement of classification performance (after retraining on 
DL ∪ (ˆx, ˆy)). 
Examples of applications of such data-driven techniques in the domain of environmental 
sciences can be found for the optimization of a monitoring network (soil pollution, radioac-tivity, 
etc.) [20], for the reduction of effort in the collection of ground truth data in remote 
sensing [37], etc. 
4.3.2 Overview of the existing techniques 
The active learning field has been fully developing in these last years and the methods 
proposed within the scientific community are following one another. In this section, the
4.3 Active Learning with SVMs 29 
main SVMs based algorithms are presented. They differ essentially in their sample selection 
criteria. 
The first approach, described in [26], is quite intuitive and consists in looking for the 
unlabeled examples belonging to DUL located in the proximity of the hyperplane separating 
the classes. The value of the SVM decision function f for every candidate ˜x is computed 
using the current training set DL and then, the one with lowest values of |f(x)| or f(x)2, 
denoted ˆx, is added to DL. This vector is in fact very likely to become a Support Vector, 
thus affecting the classification procedure. 
Entropy-based query by bagging, a method first proposed in [38] then comprehensively 
discussed in [37], takes advantage of the notion of entropy to search for the vectors of DUL 
whose class membership prediction is the most uncertain (i.e. located closest the decision 
boundary where f(x) tends to 0). Such samples are very informative and will contribute 
significantly to the set up of the SVM model. Several SVMs are trained on a subset issued 
by bootstrapping from DL and class labels for every candidate ˜x are predicted. The best 
candidate, ˆx, is selected as the one with the highest entropy computed for the resulting class 
membership probabilities. Other similar methods utilize entropy based indicators, such as 
the one proposed by Rajan et al. in [31] making use of the Kullback-Leibler divergence, to 
select the most valuable examples to include in the model. 
The last technique briefly illustrated here is thoroughly described in [19]. Kanevski et al. 
suggest the use of the following algorithm. Successively assign to the candidates ˜x the +1 
and −1 class labels y and independently add them to DL. The SVM is trained on the newly 
created set and the weights, + and −, received by the candidate for either labeling, are 
stored. At this point, a sample importance measure is computed as 
(xi) = 
( 
0 , if (+ 
i = 0, −i = C) or (+ 
i = C, −i = 0) 
+ 
i +− 
i 
2C , otherwise. 
(4.6) 
One can interpret the first case, where one of the 2 weights is null and the other is 
equal to C, as the situation where not only does the example lie far from the margin region 
−1  f(x)  +1 but it also lies on the wrong side of the decision boundary (misclassified 
atypical example) for one labeling. Hence, such vectors are not points of interest. In the 
remaining cases we give a relevance which is a C scaled mean value of the ’s. This indicator 
reports the actual average influence of a given sample in the weighted sum defining the SVM 
output f(x).
Chapter 5 
Avalanche forecasting as a 
spatio-temporal classification 
problem 
5.1 Avalanche data from Scotland: the Lochaber region 
case study 
The case study, core of this thesis, that will be illustrated in the next chapters, deals with 
the integration of spatial information into a temporal avalanche forecasting model based on 
SVMs (see [29]) for the Lochaber region, Scotland (map in figure 5.1). Therefore, the goal is 
to produce spatially varying avalanche forecasts at the level of single avalanche paths existing 
in this renowned mountaineering area. The latter is one of the 5 ski venues in Scotland for 
which forecasting is carried out on a daily basis during the winter season. The region includes 
Ben Nevis, the highest mountain in the UK with a highest point of 1344 m above sea level. 
Additional detailed information about the avalanche forecasting in Scotland can be found on 
the official website of the sportScotland Avalanche Information Service, www.sais.gov.uk. 
In the considered area, a nearest neighbor model called Cornice is used operationally 
to assist in forecasting the avalanche days [30]. This system has been briefly presented in 
subsection 1.2.2, where the prior work carried out on the Lochaber dataset has been reviewed. 
Current weather and snowpack conditions are described by a set of 9 variables that are 
measured or estimated by local forecasters on the slopes or that are registered by an automatic 
weather station (AWS). The list of the available variables is presented in table 5.1, along with 
the class/category that each variable is assigned to fall under. As described in [24], the 3 
classes of factors influencing an avalanche release in order of decreasing influence are: Class 
I - stability factors, Class II - snowpack factors, and Class III - meteorological factors. 
None of these variables belong to the group of factors that is the most related to avalanches, 
Class I. Stability factors are measured in some cases via stability tests, ski-triggering, etc. 
by the avalanche experts when executing snow profiles at the pit site. However, it is dif- 
30
5.1 Avalanche data from Scotland: the Lochaber region case study 31 
Figure 5.1: Northern part of UK: the Lochaber region is shown labeled with a red balloon. 
Variable Class Description 
Snow index III Ordinal index of the precipitation as fresh snow on a day, 
appreciated by the forecasters in the field 
Rain at 900 m III Binary variable indicating if rain is falling at 900 m, the alti-tude 
of the AWS (“1” value, “0” otherwise) 
Snow drift II Binary variable taking “1” values when experts remark snow 
drifting during the observation period (“0” otherwise) 
Air temp III Midday air temperature at the AWS measured in ◦C 
Wind speed III 24 hours vector mean speed from the AWS reported in m/s 
Wind direction III 24 hours vector mean wind direction from the AWS reported 
in ◦ 
Cloud cover III Cloud cover as percentage of the sky 
Foot penetration II Penetration of the foot in the snow measured in cm at the pit 
site by forecasters 
Snow temperature II Snow temperature at a depth of 10 cm at the pit site measured 
in ◦C 
Table 5.1: List of the 9 meteorological and snowpack variables recorded daily in the winter 
season. 
ficult to include in the model the information from these tests in terms of variables that 
can be automatically compared when building the model. The computation of Euclidean 
distances between all the possible pairs of vectors is not possible if the involved variables are 
non-numerical or if for some days data are missing. 
The pit site where different snowpack factors are measured is chosen every day at some 
different distinct location by the forecasters based on their experience. Such a testing place, 
usually located on the critical slopes of one of the gullies, is assumed to be representative of 
the average conditions that can be found in the entire region. 
The 9 variables have been recorded every day of the winter season (roughly 4 months per 
year) since 1991, as well as the avalanche events observed in the region. Such occurrences 
are documented with a description of the release type (natural or triggered by mountaineers, 
dry or wet snow, cornice triggered, etc.), notes about injuries or specific conditions related to 
the event and spatial information about location (easting and northing), altitude, slope and
32 5 Avalanche forecasting as a spatio-temporal classification problem 
aspect. Only the location coordinates were known for every case, so in order to uniformly 
characterize the events, we had to resort to a Digital Elevation Model (DEM) with 10 meters 
resolution to obtain the complete set of spatial inputs needed (elevation, slope, aspect). This 
procedure was also taken due to typing errors or to some subjective imprecise judgments in 
the records. 
The hillshade showing the relief is issued from the elevation grid and is presented in figure 
5.2 along with the location of the recorded avalanche events falling on the DEM surface. Data 
from 1991 to 2007 period were available for this study: information about 688 avalanche 
events occurred in 47 different avalanche paths was used. The subset of the events located 
in the area covered by the DEM grid (40 gullies) are reported in figure 5.3. 
Legend 
Observed avalanches 
0 500 1’000 1’500 
Meters 
Figure 5.2: Locations of the 593 documented avalanche cases occurred in the DEM-covered 
area 
Out of the 593 avalanche events falling on the DEM surface, 224 (37.8%) have been 
observed in the Ben Nevis sector (cluster of points in the south-western part of map 5.2), 
347 (58.5%) occurred in the Aonach Mor range (eastern part of the Lochaber region) while 
22 of them (3.7%) took place on the slopes of the range of Carn Morg Dearg summit (center 
of the map). 
It is, however, essential to remark that these mentioned events are mainly documented 
by avalanche experts of the region during their daily outdoor activity and by climbers or 
mountaineers supposed to be reliable witnesses of the release (online recording forms on 
www.sais.gov.uk). Therefore, when working with these avalanche reports one has to keep 
in mind that the list of events is not at all comprehensive. On bad visibility days, spotting 
a release is difficult and since snowfalls are quite often related to such conditions, it is very 
likely that many avalanches have taken place without being observed, either by forecasters 
or by mountaineers.
5.2 Set up of the spatio-temporal classification problem 33 
Legend 
Gullies 
0 500 1’000 1’500 
Meters 
Figure 5.3: Locations of the 40 gullies (avalanche paths) in the DEM-covered area 
Furthermore, the reporting of avalanches is done much more thoroughly in the Aonach 
Mor range because of the easy accessibility of the slopes. In fact there are several ski runs 
with associated lifts belonging to the Nevis Range resort. 
5.2 Set up of the spatio-temporal classification problem 
As we saw in the introductory section 1.2, the temporal forecasting carried out in the 
Lochaber region is executed by considering days with observed avalanche activity as pos-itive 
examples (class +1) and safe days with no observed avalanches as negative samples 
(class −1). Thus, the instances being classified are the days of the winter season. 
The described set up has then to be extended to the case where one is interested not 
only in correctly forecasting avalanche days but also in predicting the locations of the events. 
Initially, we assign to the positive class the vectors characterizing (spatial location, weather 
and snowpack conditions) the observed avalanche events. Details about how the features 
were built are presented in the next section 5.3. 
To complete the binary classification problem, a negative class is needed as well. The 
chosen intuitive approach is to let the class with −1 label be composed by all the 47 gullies 
(actually the 40 covered by DEM information) susceptible to give rise to an avalanche release 
on a safe day. Therefore, for every day of the winter season when all the variables listed in 
section 5.1 could be measured and the visibility allowed avalanche observations but no event 
was actually documented in the region, we computed all the features describing the local 
conditions at each avalanche path. These spatially variable features were then combined 
with the global ones related to the current safe day, concerning the whole mountain domain.
34 5 Avalanche forecasting as a spatio-temporal classification problem 
In this way, a broad list of negative instances was produced to be given to the learning 
machine. The purpose of this was to let the classifier train on a set of critical situations 
which were close to the “safe/event” decision boundary and likely to cross it under slightly 
different weather conditions. 
Finally, this results in a binary classification problem where the vectors to discriminate are 
daily avalanche activities of the dangerous paths (gullies) located in the forecast area. This 
becomes a very unbalanced classification task because, as shown by table 5.2, the negative 
inputs our model will be given considerably outnumber the positive ones (by a factor of 
approximately 68 to 1). 
Class +1 Class −1 
Avalanche events Safe gullies 
667 45240 = 40 gullies ·1131 safe days 
Table 5.2: Dataset positive and negative classes resulting in an unbalanced classification 
problem. 
5.3 Choice and conception of the input features 
In order to get the desired spatio-temporal forecast, the series of daily measurements of 
meteorological conditions related to snowpack stability described in section 5.1 have to be 
smartly combined with the spatial description of the terrain morphology available via the 
DEM of the region under study. The latter, with its relatively high resolution of 10 me-ters, 
provides detailed information about elevation, slope and aspect of the paths where the 
avalanche events could happen. This results in a “spatialized” set of local condition features 
with changing values according to the location of the avalanche release point. Additionally, 
for some of the temporal variables, information about avalanching conditions recorded in the 
previous days were also included (2 pre-days at most because of the rapidly changing weather 
conditions). Therefore, the features created, considering the advices of the avalanche experts 
of the Lochaber region as well, are designed to account for the relevant factors influencing 
avalanche activity. 
The final input vector counted 39 features: 22 spatio-temporal features (describing local 
conditions at a given gully or at the release zone) and 17 temporal features with global 
validity (the same for all gullies). The complete list, with a brief description of the meaning 
of each variable is presented in table 5.3. For a subset of features (names tagged with *), 
additional details about how a given feature has been created are provided hereafter. 
The first type of variables requiring some further explanation are the ones involving the 
sine and cosine transforms. These features take as input either the wind direction or the 
aspect. Since these kinds of variables report a direction measured in degrees ranging from 0 
to 360, clockwise starting from north, it is clear that they can not be compared directly by 
a subtraction when looking for dissimilarities (e.g. Gaussian RBF kernel). For example, two 
slopes with very low or very large values will be both north facing slopes. We get around this 
peculiarity by taking a sine transform which will project the directions on the “horizontal”
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci
MasterThesisFinal_09_01_2009_GionaMatasci

More Related Content

Similar to MasterThesisFinal_09_01_2009_GionaMatasci

Innovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based PersonInnovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based Person
Austin Jensen
 
Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.
Pedro Ernesto Alonso
 
Flexible and efficient Gaussian process models for machine ...
Flexible and efficient Gaussian process models for machine ...Flexible and efficient Gaussian process models for machine ...
Flexible and efficient Gaussian process models for machine ...
butest
 
Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]
Joachim Nkendeys
 
CGuerreroReport_IRPI
CGuerreroReport_IRPICGuerreroReport_IRPI
Master In Information And Communication Technology.doc
Master In Information And Communication Technology.docMaster In Information And Communication Technology.doc
Master In Information And Communication Technology.doc
Dịch vụ viết đề tài trọn gói 0934.573.149
 
MSc_thesis_OlegZero
MSc_thesis_OlegZeroMSc_thesis_OlegZero
MSc_thesis_OlegZero
Oleg Żero
 
SeanLawlor_Masters_Thesis
SeanLawlor_Masters_ThesisSeanLawlor_Masters_Thesis
SeanLawlor_Masters_Thesis
snowboardfreak63
 
Ellum, C.M. (2001). The development of a backpack mobile mapping system
Ellum, C.M. (2001). The development of a backpack mobile mapping systemEllum, C.M. (2001). The development of a backpack mobile mapping system
Ellum, C.M. (2001). The development of a backpack mobile mapping system
Cameron Ellum
 
masteroppgave_larsbrusletto
masteroppgave_larsbruslettomasteroppgave_larsbrusletto
masteroppgave_larsbrusletto
Lars Brusletto
 
eur22904en.pdf
eur22904en.pdfeur22904en.pdf
eur22904en.pdf
Carina Lifschitz
 
Marshall-MScThesis-2001
Marshall-MScThesis-2001Marshall-MScThesis-2001
Marshall-MScThesis-2001
Joshua Marshall
 
Seminar- Robust Regression Methods
Seminar- Robust Regression MethodsSeminar- Robust Regression Methods
Seminar- Robust Regression Methods
Sumon Sdb
 
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODSNOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
Canh Le
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
HamdaAnees
 
DISS2013
DISS2013DISS2013
Anwar_Shahed_MSc_2015
Anwar_Shahed_MSc_2015Anwar_Shahed_MSc_2015
Anwar_Shahed_MSc_2015
Shahed Anwar
 
outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
ssusere02009
 
Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo Services
Felipe Diniz
 
PhD dissertation
PhD dissertationPhD dissertation
PhD dissertation
Alexandre Colmant
 

Similar to MasterThesisFinal_09_01_2009_GionaMatasci (20)

Innovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based PersonInnovative Payloads for Small Unmanned Aerial System-Based Person
Innovative Payloads for Small Unmanned Aerial System-Based Person
 
Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.
 
Flexible and efficient Gaussian process models for machine ...
Flexible and efficient Gaussian process models for machine ...Flexible and efficient Gaussian process models for machine ...
Flexible and efficient Gaussian process models for machine ...
 
Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]
 
CGuerreroReport_IRPI
CGuerreroReport_IRPICGuerreroReport_IRPI
CGuerreroReport_IRPI
 
Master In Information And Communication Technology.doc
Master In Information And Communication Technology.docMaster In Information And Communication Technology.doc
Master In Information And Communication Technology.doc
 
MSc_thesis_OlegZero
MSc_thesis_OlegZeroMSc_thesis_OlegZero
MSc_thesis_OlegZero
 
SeanLawlor_Masters_Thesis
SeanLawlor_Masters_ThesisSeanLawlor_Masters_Thesis
SeanLawlor_Masters_Thesis
 
Ellum, C.M. (2001). The development of a backpack mobile mapping system
Ellum, C.M. (2001). The development of a backpack mobile mapping systemEllum, C.M. (2001). The development of a backpack mobile mapping system
Ellum, C.M. (2001). The development of a backpack mobile mapping system
 
masteroppgave_larsbrusletto
masteroppgave_larsbruslettomasteroppgave_larsbrusletto
masteroppgave_larsbrusletto
 
eur22904en.pdf
eur22904en.pdfeur22904en.pdf
eur22904en.pdf
 
Marshall-MScThesis-2001
Marshall-MScThesis-2001Marshall-MScThesis-2001
Marshall-MScThesis-2001
 
Seminar- Robust Regression Methods
Seminar- Robust Regression MethodsSeminar- Robust Regression Methods
Seminar- Robust Regression Methods
 
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODSNOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
DISS2013
DISS2013DISS2013
DISS2013
 
Anwar_Shahed_MSc_2015
Anwar_Shahed_MSc_2015Anwar_Shahed_MSc_2015
Anwar_Shahed_MSc_2015
 
outiar.pdf
outiar.pdfoutiar.pdf
outiar.pdf
 
Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo Services
 
PhD dissertation
PhD dissertationPhD dissertation
PhD dissertation
 

MasterThesisFinal_09_01_2009_GionaMatasci

  • 1. Faculty of Geosciences and Environment Support Vector Machines for Spatio-Temporal Avalanche Forecasting Giona Matasci Master of Science in Environmental Geosciences Supervisors: Experts: Prof. Mikhail Kanevski Dr. Ross Purves Dr. Alexei Pozdnoukhov Devis Tuia January 2009
  • 2.
  • 3. Title page image: Aonach Mor cornices, source: saislochaber.blogspot.com
  • 4.
  • 5. i Abstract Statistically based methods for avalanche forecasting have been widely developed in many regions subject to this kind of natural hazard to detect avalanche days. Such techniques are often based on simple supervised classification methods like Nearest Neighbors and only focus on the temporal component of the avalanche activity. The purpose of this Master thesis is to build a reliable spatio-temporal forecasting model that is able to efficiently integrate spatial information about avalanche events. The application of machine learning algorithms for patter recognition, namely Support Vector Machines, is demonstrated with a case study on a dataset from Lochaber, Scotland, UK. Encouraging results were obtained in this extension of the usual forecasting procedure. The meteorological and snowpack factors globally describing avalanche likelihood in the mountain area have been combined with spatial features (issued from a Digital Elevation Model) related to the avalanche paths where the events have been observed. Hence, thanks to a huge database consisting of 17 years of daily condition observations matched with release occurrences, we could develop an excellent decision-support tool to assess the avalanche danger with a considerable spatial resolution (gullies, particular slopes, etc.). Interesting results, expressed in terms of confusion matrices related to the predictions on a test dataset (forecasts of gullies avalanche activity) as well as avalanche danger maps, are presented in this research report. Besides, the behavior of the model in discriminating safe/risky situations when dealing with critical changing conditions affecting the snowpack is proven to be consistent after a perceptive validation based on the analysis of some observed cases (a specified avalanche path on a given day). Moreover, the use of SVMs auxiliary techniques allowed to automatically highlight the most meaningful features to include in sta-tistical models aimed at successfully predicting avalanche releases in time and space. Finally, always taking the same state-of-the-art learning machine as starting point, elements of the sensitivity of the model and suggestions concerning a possible improvement of the avalanche monitoring procedure are also provided. Keywords: statistical avalanche forecasting, natural hazards, spatial and temporal ap-proach, machine learning, supervised classification, kernel methods, Support Vector Ma-chines, Nearest Neighbors, feature selection, active learning, sensitivity analysis, avalanche path, GIS mapping, Lochaber region
  • 6. ii R´esum´e Les m´ethodes statistiques de pr´evision d’avalanches ont ´et´e largement d´evelopp´ees dans de nombreuses r´egions sujettes `a ce type de danger naturel. Ces techniques sont souvent fond´ees sur de simples m´ethodes de classification supervis´ee comme celles des Plus Proches Voisins (Nearest Neighbors) et se concentrent seulement sur la composante temporelle du danger d’avalanches. Le but de ce travail de Master est de construire un fiable mod`ele de pr´ediction au niveau spatio-temporel capable ainsi d’int´egrer efficacement des informations spatiales sur les ´episodes d’avalanche. L’application d’algorithmes d’apprentissage automatique (machine learning) pour la reconnaissance des formes, `a savoir celui des S´eparateurs `a Vaste Marge (Support Vector Machines), comme il a ´et´e d´emontr´e avec un cas d’´etude concernent la r´egion de Lochaber en ´ Ecosse, Royaume-Uni, a r´ev´el´e des r´esultats encourageants dans cette extension des proc´edures habituelles de pr´evision. Les facteurs m´et´eorologiques et ceux li´es au manteau neigeux d´ecrivant globalement les conditions d’avalanche ont ´et´e combin´es avec des informations spatiales (sorties d’un Mod`ele Num´erique de Terrain) li´es aux couloirs d’avalanche o`u les ´ev´enements ont ´et´e observ´es. Ainsi, grˆace `a une vaste base de donn´ees constitu´ee de 17 ann´ees d’observations quotidiennes des situations d’avalanche et d´eclenchements associ´ees, nous avons pu obtenir un excellent outil d’aide `a la d´ecision pour ´evaluer le danger d’avalanche avec une bonne r´esolution spatiale (ravines, types de pentes sp´ecifiques, etc.). Des int´eressants r´esultats en termes de matrices de confusion li´es aux pr´edictions sur un ensemble de donn´ees de test (pr´evisions de l’activit´e avalancheuse des diff´erents couloirs), ainsi que des cartes de danger d’avalanche sont pr´esent´es dans ce rapport. En outre, le comportement du mod`ele lors de la discrimination des situations sˆures de celles `a risque, dans le cadre d’une ´evolution critique des conditions affectant le manteau neigeux, s’est av´er´e ˆetre tr`es satisfaisant. Cela apr`es une validation perceptive bas´ee sur l’´etude de cas r´eellement observ´es (un couloir d’avalanche bien d´efini en un jour donn´e). En outre, le recours `a des techniques auxiliaires li´ees aux SVMs a permis de mettre en ´evidence automatiquement quelles sont les variables les plus importantes `a inclure dans les mod`eles statistiques visant `a pr´edire avec succ`es les avalanches dans le temps et dans l’espace. Enfin, toujours en utilisant la mˆeme performante m´ethode d’apprentissage supervis´e comme point de d´epart, des ´el´ements sur la sensibilit´e du mod`ele et des suggestions concernant une ´eventuelle am´elioration de la proc´edure de contrˆole des avalanches sont ´egalement fournis. Mots cl´es: pr´evision d’avalanches statistique, dangers naturels, approche spatiale et temporelle, apprentissage automatique, classification supervis´ee, m´ethodes `a noyaux, S´eparateurs `a Vaste Marge, Plus Proches Voisins, s´election des variables, apprentissage actif, analyse de la sensibilit´e, couloir d’avalanche, cartographie SIG, r´egion de Lochaber
  • 7. iii Acknowledgments First and foremost, I am grateful for the advice and support of both my supervisors during the whole Master program. I thank Prof. Mikhail Kanevski for having introduced me to the field of machine learn-ing as well as its applications to environmental sciences and for his interest in my research. I would like to thank Dr. Alexei Pozdnoukhov, first, for his huge availability and patience when supervising me, then, for having guided me with throughout this thesis with construc-tive suggestions about the topics to focus on. His great help when dealing either with the theoretical aspects of the methods used or with their concrete implementation will not be forgotten. Thank you Alexei! Dr. Ross Purves is acknowledged for the interesting discussions about avalanche forecasting in Scotland and for the useful hints provided. Moreover, I also appreciated a lot the aid and ideas given to me by Devis, Loris, Fr´ed and the rest of the team of the geomatics group at IGAR during the work for my Master thesis. A big and deep “grazie” is addressed to my family, in particular to my parents Franca and Sandro, for the support they provided me during these years spent at the university in Lausanne. All my friends scattered in Switzerland as well as the “sp´ecialisation 2” crew of the Master deserve gratitude for the funny moments spent together during this period. Last but not least, I am grateful to the “US” relatives, namely Louis and Caroline, for proofreading the English. ...and all those I forgot, thank you!
  • 8. Contents 1 Introduction 1 1.1 Objectives and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Prior work on data-driven statistical avalanche forecasting . . . . . . . . . . . 2 1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Prior work on avalanche forecasting in the Lochaber region . . . . . . 3 2 Machine Learning 6 2.1 Supervised learning vs. unsupervised learning . . . . . . . . . . . . . . . . . 6 2.1.1 Nearest Neighbors for classification . . . . . . . . . . . . . . . . . . . . 7 2.2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Structural Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Model selection and model assessment . . . . . . . . . . . . . . . . . . . . . . 11 3 Support Vector Machines for classification 12 3.1 Large margin linear classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Optimal separating hyperplanes . . . . . . . . . . . . . . . . . . . . . . 12 3.1.2 The optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3 Support Vectors and their relevance . . . . . . . . . . . . . . . . . . . 15 3.1.4 Soft margin adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Kernel expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 The principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 A concrete example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Valid kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.4 Details on the Gaussian RBF kernel . . . . . . . . . . . . . . . . . . . 19 3.3 Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Binary classification quality measures . . . . . . . . . . . . . . . . . . . . . . 21 4 Extensions of the SVMs-based approach 24 4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1 Methods overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.2 SVM-Recursive Feature Elimination . . . . . . . . . . . . . . . . . . . 25 4.2 Probabilistic SVM output interpretation . . . . . . . . . . . . . . . . . . . . . 26 iv
  • 9. CONTENTS v 4.2.1 Interpretations for decision support . . . . . . . . . . . . . . . . . . . . 26 4.2.2 The sigmoid transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.3 Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Active Learning with SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 Overview of the existing techniques . . . . . . . . . . . . . . . . . . . . 28 5 Avalanche forecasting as a spatio-temporal classification problem 30 5.1 Avalanche data from Scotland: the Lochaber region case study . . . . . . . . 30 5.2 Set up of the spatio-temporal classification problem . . . . . . . . . . . . . . . 33 5.3 Choice and conception of the input features . . . . . . . . . . . . . . . . . . . 34 6 Prediction of avalanche activity at individual paths 38 6.1 SVM training and parameters tuning . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.2 Model optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.2 Predictions for years 2006-2007 . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2.2 Comments and observations . . . . . . . . . . . . . . . . . . . . . . . . 46 7 Avalanche danger mapping 50 7.1 Avalanche danger assessment: probabilistic SVM output tuning . . . . . . . . 50 7.2 Mapping on the prediction grid . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.3 Gradient mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 8 Extended analysis of avalanche data with SVMs-related methods 56 8.1 Relevant features choice: RFE . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.1.1 Set up of the automatic procedure . . . . . . . . . . . . . . . . . . . . 56 8.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 8.1.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8.2 Model behavior under changing conditions . . . . . . . . . . . . . . . . . . . . 60 8.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 8.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.2.3 Results and interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.3 Active Learning as an exploratory tool in avalanche monitoring . . . . . . . . 66 8.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9 Conclusions 69 9.1 Main achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 9.2 Further work on this topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 A European Danger Scale 76 B Avalanche danger maps 77
  • 10. vi CONTENTS C MATLAB code: gradient mapping 79
  • 11. Chapter 1 Introduction 1.1 Objectives and motivation The machine learning domain, presented in chapter 2, provides many scientific research fields, especially in the last few years, with a solid framework based on a full variety of techniques aimed at the analysis of datasets of increasing complexity and size. Particularly, the environmental sciences area appears to be one of the well-matched sub-jects where such methods could be applied. In fact, among the broad variety of subfields related to geosciences, the latest progress in the automatic extraction of dependencies from data have found a great application on the forecasting of natural hazards, a theme frequently discussed during the attended Master program. Predictive models founded on the concepts issued from machine learning are robust and very well suited for operational danger assess-ment purposes. From this point of view, the topic of avalanche forecasting shows significant potential promising developments. The statistical approach frequently used to evaluate the likelihood of snow releases on the slopes of a mountain (see the prior work report in section 1.2) can be improved to obtain an extended and enhanced decision-support system helping avalanche forecasters in their daily job. However, the main purpose of this work is to explore the possible applications of several machine learning techniques in this research field, without focusing particularly on the issues affecting operational aspects of forecasting. The reasons behind such an approach are mainly related to the fact that studies joining these two domains are in their early stages and the realization that my knowledge of the specificities of the avalanche forecasting process is not adequate compared to that of forecasters with years of experience. Nonetheless, the scope of this work is to build a reliable predictive model aimed at giv-ing an efficient spatial extension of the forecasting systems originally designed to produce predictions about global avalanche activity over a whole region. Therefore, the morphologi-cal characteristics of the mountain range terrain affecting local scale weather and snowpack conditions will be taken into account by the presented learning machine. The core of the analysis is centered on the well-known supervised classification method named Support Vector Machines (SVMs). This product of Statistical Learning Theory will be discussed in chapter 3. The performance of such a classifier when dealing with high- 1
  • 12. 2 1 Introduction dimensional data will allow the incorporation of a wide range of features describing avalanch-ing conditions at the level of single avalanche paths. The classification problem will be set up by matching these variables with the related actual activity of a given gully, giving rise either to an avalanche event or to a safe situation. This spatio-temporal approach to avalanche forecasting is described in chapter 5, while the results in terms of the classification quality of the predictions for 2006 and 2007 winter seasons are reported in chapter 6. While focusing on SVMs as the main root of the methodological part of the work, the objectives of the research also consist in developing some tools, based on classical machine learning/SVMs data-driven approaches described in chapter 4, used to highlight some prop-erties of the avalanche hazard studied by taking into account the spatial variation of the phenomenon. The feasibility of a mapping of the avalanche danger over the region under study will be considered in chapter 7. Then, we attempt to identify the most useful features to involve in the classification task by assessing their real influence on the decisions taken by the model and on the evolution of the avalanche danger. Next, we investigate the actual sensitivity of the model to changing meteorological and snowpack conditions. Furthermore, some suggestions are given for the possible optimization of the information gathering pro-cedure through improvements in the avalanche monitoring task. All these aforementioned topics will be covered in chapter 8. This thesis extends the previous work on this topic (see [29]) carried out by Dr. Alexei Pozdnoukhov during his post-doctoral fellowship at the Institute of Geomatics and Analysis of Risk (IGAR) of the University of Lausanne (information about the main research achieve-ments on www.geokernels.org). The case study that will be treated concerns the region called Lochaber, located in the northern part of Scotland, UK, that is subject to numerous avalanche events during the winter season. Avalanche data collected on the slopes of these mountain ranges were available because of the previous collaboration between IGAR and the sportScotland Avalanche Information Service (www.sais.gov.uk) thanks to the contribution of Dr. Ross Purves. 1.2 Prior work on data-driven statistical avalanche fore-casting 1.2.1 Overview Avalanche forecasting is a crucial task for many winter resorts where a lot of skiers, moun-taineers, climbers are present every day. The procedure, which results in a report of avalanche conditions with associated danger, is carried out manually by the forecasters of the region. These experts are in the field every day to understand the evolution of the different factors affecting avalanche releases. Information about snowpack conditions and stability, weather parameters and actual avalanche activity are collected by the observers on a daily basis. Nevertheless, in some skiing venues, numerical models are available to support the de-cisions taken based on the experience of the forecasters. Some physical models exist to aid in the assessment of snowpack evolution (see [1] for the case of Switzerland) but, generally, statistical approaches ensuring forecasting systems are much more commonly used. These
  • 13. 1.2 Prior work on data-driven statistical avalanche forecasting 3 models are devoted to the prediction of current avalanche activity by looking for similari-ties with conditions influencing releases recorded in the past (meteorological and snowpack factors essentially). The statistical models currently operationally used or tested on real avalanche data are producing temporal forecasts about global avalanche activity in a given region on a given day. Avalanche days and safe days are discriminated using several different statistically based techniques belonging to the supervised learning category (pattern recognition). These methods include discriminant analysis [13], regression trees and Nearest Neighbors [3]. The last technique mentioned is widely applied for operational forecasting in many dif-ferent countries. For example, in Switzerland the NXD system (NXD2000 and NXD-REG, described in [14] and [2]), developed by the Swiss Federal Institute for Snow and Avalanche Research (SLF), is used at a local and regional scale to help experts produce final avalanche danger reports. These specialists receive as model output the 10 most similar days (included in the database of past observed conditions) to the current day’s situation. By checking under which conditions and in which locations avalanches have been observed on these days, they are given concrete helpful information to use in assessing the actual avalanche danger. The next subsection will illustrate the use of these nearest neighbors methods in Scotland. 1.2.2 Prior work on avalanche forecasting in the Lochaber region The case study that will be discussed throughout this thesis concerns avalanche forecasting, namely the forecasting including the spatial component of the avalanche activity, in the Lochaber area, Scotland, United Kingdom. In this introductory part of the work I will present a short survey of work done in this field using the same avalanche data. Nearest Neighbors model Cornice Purves et al. in [30] describe the Nearest Neighbors model developed for the operational forecasting of avalanche activity in the Scottish mountainous region under study. In con-junction with local avalanche forecasters, the scientists involved in this project implemented a decision-support system called Cornice which is providing useful information about past avalanching conditions helpful in producing a reliable hazard report. The forecasts are made available in the afternoon (around 3 pm) and include a description of the situation experi-enced during the day as well as the expected development of avalanche activity over the next 24 hours. The model takes as inputs different meteorological and snowpack variables influencing the release of avalanches in the region (a list of the available variables is given in table 5.1 in section 5.1). A historical database starting in 1991 is then searched. The outputs consist of the values taken by the same input variables during the 10 most similar recorded days (Euclidean distance of equation (2.1) of page 7 as a dissimilarity measure). Additionally, the spatial locations of the documented avalanche events occurring during these days are also shown on a geo-referenced map. Hence, both the causes, in terms of weather/snowpack conditions, and the consequences, in terms of possible avalanche events, are available to the forecasters.
  • 14. 4 1 Introduction The model developers did not use subjective weighting of the inputs based on forecasters’ experience, but instead chose to implement an automated procedure to find the optimal weights. The optimization of variables’ relevance has been carried out by means of genetic algorithms using several fitness metrics to evaluate the ability of different sets of weights to correctly forecast avalanche and non-avalanche days. For both the optimization of the parameters and the verification (testing) of the model, on a given day, a forecast of avalanche activity is produced if 3 of more of the 10 nearest neighbors were avalanche days. If this threshold is not reached, the day under examination is forecasted as safe. The batch testing of the model (assessing the generalization error by cross-validation) has been carried out on 1323 days (actually 1005, because of no-visibility days), covering the years from 1991 to 2002, in order to evaluate the agreement of model forecasts with the observations. The results can be summarized with binary confusion matrices (contingency tables) for which several categorical statistics can be computed (see section 3.4). The best prediction performances were obtained with an optimization via either Hanssen and Kuipers discriminant or Unweighted average accuracy, leading to an Overall Accuracy of 0.83 and to a Hanssen and Kuipers discriminant value of 0.61. The models correctly forecasted slightly more than 200 avalanche days, with only approximately 60 misses and 115 false alarms. The Cornice application produced quantitative results considered very encouraging by the authors. However, its main utility is clearly recognized as a support for the forecasters in the information gathering and hypothesis testing process allowing avalanche danger assessment. Support Vector Machines model This temporal avalanche forecasting approach has been revisited by Pozdnoukhov et al. in [29] who applied machine learning methods to the Lochaber dataset with the purpose of increasing the accuracy of the predictions. In this work, the performing supervised classifier called Support Vector Machine is used firstly to improve the discrimination ability in the temporal predictive task (avalanche days vs. non-avalanche days), then is applied to make a preliminary extension to a spatial avalanche danger forecasting. The adopted methodology was centered on a pure data-driven approach starting with the selection of the relevant features to be employed using the automated procedure called Recursive Feature Elimination (see section 4.1). An initial set of 44 variables, comprised of combinations of the variables measured on the slopes (current day features, previous days features, expert features), was filtered out by retaining the 20 most valuable non-redundant features for the classification task. After SVM parameters optimization by cross-validation on the winters from 1991 to 2001, a test of the model performance was carried out on the 712 days of observations in the period 2001-2007. The method showed a satisfactory ability to detect avalanche days: the Overall Accuracy reached 0.86 whilst the Hanssen and Kuipers discriminant scored 0.64. A comparison with nearest neighbors methods applied on the same dataset demonstrated a slight superiority of the SVM technique. Furthermore, a transform of the SVM decision function into a probability (see section 4.2 for details about the method) allowed a reliable interpretation of the outputs of the model in terms of the likelihood of an avalanche occurring on a given day (application to 2003/2004
  • 15. 1.2 Prior work on data-driven statistical avalanche forecasting 5 winter). Given the well-known ability of this machine learning method to deal with high-dimensional data, an additional set of spatially varying features such as altitude, slope or aspect was added to the vector describing the avalanching conditions on a specified day. The purpose was to characterize the local situation at each avalanche path of the Lochaber region by providing the model with examples of about 700 avalanche events whose spatial attributes have been documented. The authors have then been able to extrapolate the avalanche activity indicator over the whole study area thanks to a digital elevation model (DEM). Such a spatio-temporal approach has been presented as an early result of a procedure needing refinements and further work aimed at the assessment of the validity of the results. This initial work as well as some improvements (spatial distribution of some meteorological features such as wind fields, etc.) already put into practice by the cited researchers (see [28]) is taken as a starting point for this thesis.
  • 16. Chapter 2 Machine Learning The broad research field of machine learning, rapidly developing in the last decades, is often described as a subtopic of computer science whose outlining concepts and ideas derive from closely related domains such as statistics and artificial intelligence. The notion of machine learning can be presented, in an overall view, as a collection of techniques that are able to “learn” the dependencies existing in the data affecting a given predictive task from examples (tasks description in section 2.1). The different methods are designed so that the learning procedure takes place in an automatic and data-driven way. This means that, in general, no human prior knowledge or assumptions concerning data probability distributions are used during the process. For obtaining a good foundation on the topic and for additional information [6] is suggested. The fields, with related real-world applications, concerned by these state-of-the-art tech-niques are countless. The ones which have been earlier involved include bioinformatics/biom-etry (biosequences analyses), chemistry (cheminformatics/chemometrics), medicine (diag-noses), data mining (financial data), web and text mining (text or webpages categorization), speech and hand-written character recognition, etc. Nevertheless, the development of the research in the area of environmental sciences took place only later on with applications in domains such as spatial interpolation, remotely sensed images classification, etc. (see [17], [18]). In fact, the geo-spatial phenomena modeling would benefit a lot from the operational use of the latest breakthroughs occurred within the machine learning community. Avalanche forecasting in particular, the topic of this thesis, is one of the geosciences domains for which machine learning methods show much promise [29]. 2.1 Supervised learning vs. unsupervised learning Machine learning methods may be classified into the categories of supervised and unsupervised learning. Supervised learning can be thought of as a process by which a learning machine is guided throughout a training procedure to learn the input/output relationships existing in the data set. These examples are called the training data. Each individual sample/example is de-scribed by an input vector x belonging to RN, usually referred to as the input, and presents 6
  • 17. 2.1 Supervised learning vs. unsupervised learning 7 a related known output y. This means that each sample can be represented as a vector in an N-dimensional space (N variables). Depending on the type of y value one can define the task as a regression problem or a classification problem (pattern recognition). In the first case the output associated with a given input is a real value y ∈ R. In the second case, with which this thesis will be dealing with, output values are discrete, resulting in a binary classification task if y ∈ {−1, 1} or in a m multi-class classification task if y ∈ {1, 2, . . . ,m}. The learning machine, after having seen all L training examples {(x1, y1) , . . . , (xL, yL)}, then provides an estimate of the original function y = f(x) mapping the inputs to the output domain. The other learning approach may be termed unsupervised learning. In this case the learning machine is not provided with the outputs y and the method goal is to extract information about the process which generated the data. The main types of this kind of learning are clustering (also known as cluster analysis) and density estimation. The first one listed is concerned with the grouping of the data points into clusters whose members have similar characteristics, without knowing their true class labels. Density estimation methods attempt to model the underlying probability distribution of a certain observed phenomenon. Combinations of the supervised and unsupervised domains are also possible resulting in semi-supervised learning, an approach where labeled and unlabeled examples are provided at the same time to the learning machine. A summary of these hybrid techniques, implemented to make use of all the available information in order to improve the predictive model, can be found in [5]. The present thesis is mainly dealing with the supervised approach for binary classification problems. The chosen learning system, and its associated tools, is known as Support Vector Machines (SVMs). The technique is part of the subfield of machine learning referred to as kernel methods [35]. This supervised classification method based on the so-called Support Vectors will be detailed in chapter 3. In [11] the reader will find a comprehensive description of other supervised learning techniques. These include Fisher’s Linear discriminant analysis, Logistic regression, Decision trees, Multi-Layer Perceptrons, Probabilistic Neural Networks, k-Nearest Neighbors, etc. The latter will be discussed in the next subsection (2.1.1) since it is a benchmark method widely used in avalanche forecasting. 2.1.1 Nearest Neighbors for classification The technique called k-Nearest Neighbors (k-NN) is probably the most intuitive method to solve a classification problem. One can reasonably think that similar inputs x, in other words examples described by variables taking analogous values, will possess, in most of the cases, the same output class label y. This will lead to a decision about the class membership of a new point x based on its Euclidean distance (see equation (2.1)) to the training samples xi. The cited dissimilarity measure between sample u and v is computed as dist(u, v) = vuut XN d=1 (ud − vd)2, (2.1) where d is the variable index.
  • 18. 8 2 Machine Learning In order to predict the class label y of the vector currently under consideration, a ma-jority vote is set up between the k-nearest examples (k smallest distances) found in the N-dimensional input space. With fixed distance measure, the only parameter to tune to get the optimal accuracy in the class label assignments is the number k of neighbors to include in the decision vote. Essentially, choosing a low value of k corresponds to assuming that data are not corrupted by noise (structured dataset), so that a close correspondence can be established between the training vectors we dispose and the new ones whose label y should be forecasted. On the contrary, choosing a large k in most cases means that we believe that the configuration of the training examples is much unstructured, leading to a tricky input/output matching. This will give rise to a decision process involving a larger set of neighboring examples, approaching a simple general majority vote when k tends to N. The approach presented here provides good results particularly for low-dimensional datasets. Due to this success, as well as its appealing logic, k-Nearest Neighbors is often used as a refer-ence technique. On the other hand, when dealing with many variables, this algorithm suffers the so-called curse of dimensionality. In a high-dimensional input space, often, new samples whose labels are to be predicted by looking at their neighborhood are found to be equally far from the all training inputs, precluding any reliable prediction. 2.2 Statistical Learning Theory In the domain of machine learning, Statistical Learning Theory [39], also known as Vapnik- Chervonenkis theory, first developed by V. Vapnik in the 1960-70, provides a good framework for so-called predictive learning. The main goal of this theory is the optimal assessment of a model according to a trade-off between its ability to honor the available information and its complexity. As stated in section 2.1, a supervised learning model, at the end of the training pe-riod, retains a function executing the mapping y = f(x), typically called decision func-tion for a classification problem. This function should be chosen from a set of functions F = {f(x,), ∈ }, where represents a vector of parameters selected from the set . According to Vapnik’s concepts, the criteria used to evaluate the goodness of the choice of a given function f(x), in other words the similarity to the unknown target function that depicts the actual input/output dependencies, is the following risk functional, called the expected risk: R() = Z Q(y, f(x,))dP(x, y), (2.2) where Q(y, f(x,)) is a task-defined loss function and P(x, y) is the unknown joint proba-bility distribution of the examples. As can be intuitively understood the risk should be as low as possible so that our goal is to minimize the expected average loss (2.2). Reviewing the two main learning problems already mentioned (omitting clustering and density estimation) let us introduce the loss function most commonly used in pattern recog-
  • 19. 2.2 Statistical Learning Theory 9 nition: Q(y, f) = ( 0 if f(x) = y 1 otherwise. (2.3) For such a loss function, the resulting expected risk is nothing but the probability of a classification error. In the domain of regression problems the aim is to minimize the differences between the actual output value y and the predicted one f(x) for every example. This is translated into mathematical terms, in most cases, by means of the squared loss function Q(y, f) = (y − f(x))2. (2.4) 2.2.1 Empirical Risk Minimization Once the principles allowing us to evaluate the performance of a learning machine have been defined, Statistical Learning Theory reminds us that, in fact, the distribution P(x, y) of equation (2.2) is unknown so that the only known input/output pairs are those of the given finite set of examples. The first thought is to approximate the theoretical risk functional by an empirical one simply computed on the training examples as Remp = 1 L XL i=1 Q(yi, f(xi,)), (2.5) where L is the number of training samples. A minimization of the function, the Empirical Risk Minimization, is then carried out in order to select the best set of parameters . However, such a choice is strongly dependent on the examples provided to the learning machine for training. As discussed in more detail in section 2.3, it is possible to partially circumvent this drawback by using a cross-validation methodology or by splitting the initial dataset into 2 parts (use of an independent set of data). Additionally, always in the same section it will be explained that when aiming at evaluating the overall performance of the learning machine another set of examples is required. 2.2.2 Structural Risk Minimization In the theoretical framework of Statistical Learning Theory, with the purpose of considering the ability of a model to extend the learnt relationships to unobserved new data, the notion of Structural Risk Minimization is introduced. Essentially, the idea is to place an upper bound for the expected risk (2.2) which varies according to the empirical risk and a defined confidence interval such that R() ≤ Remp() + s h ) + 1) − log( h(log( 2L 4 ) L , (2.6) where L is the number of training samples and h is the so-called Vapnik-Chervonenkis di-mension (VC-dimension) of the function used [39]. The resulting inequality, which holds with probability 1 − , reports a particular bound valid only for the classification case.
  • 20. 10 2 Machine Learning The quantity h needs some further enlightenment because it is one of the main concepts of Vapnik’s theory. For a binary classification problem, h can be interpreted as the maximum number of samples for which a class-consistent partitioning can be achieved using the set of functions. A two dimensional data set consisting of 3 vectors can always be separated with a linear function, no matter which is the labeling of the points. A difficulty occurs if the samples to shatter become 4: a chessboard-like setting will forbid any valid linear separation. Finally, we can state that linear decision functions in RN , hyperplanes of the form f(x) = wx + b, possess a VC-dimension of N + 1. In comparison to this, a polynomial function of degree 2 applied in R2 has a VC-dimension of 4 and, as borderline case, for the function f(x) = b sin(wx) this quantity is equal to infinity (high frequency for a large kwk, allowing the separation of every possible configuration of points). Looking at equation (2.6), it may be seen that the expected risk is minimized when the confidence interval, the second term in the right side of the inequality, is kept small by a low h L ratio. By the mentioned inequality, a function with a large VC-dimension h which perfectly fits a small number of data points L will result in a large expected risk since there is an overfitting. Such a complex model will likely lead to an important generalization error. Figure 2.1 illustrates how the bound on risk varies depending on model complexity. Figure 2.1: Bound on risk varying according to the confidence interval and the empirical risk associated to sets of models of increasing complexity. After [39]. To summarize, the Structural Risk Minimization principle provides a theoretical frame-work for achieving the optimal trade-off between the classification accuracy on training data and the capacity of the set of functions selected. Later on, in subsection 3.1.4 of chapter 3, we will have a look to the concrete means the SVM algorithm supplies to handle this kind of issue. Next section illustrates the general procedure adopted when using a supervised learning approach.
  • 21. 2.3 Model selection and model assessment 11 2.3 Model selection and model assessment The preceding sections have discussed how Statistical Learning Theory allows the evaluation of the performance of a model with respect to its complexity. When one is concretely applying a supervised learning classification algorithm there are several practical considerations that need to be respected in order to properly use the method. First, the model selection step is crucial. The fact that the empirical error (training error) is computed on the training examples given to the learning machine should be taken into consideration when one is proceeding with the choice of the optimal parameters. A model that closely or perfectly fits noisy or non-representative training data (see example of figure 2.2), is said to overfit (in opposition to a too simple model giving rise to the situation called underfitting). Overfitting results in a poor generalization ability of the system when dealing with new data. It is required that the tuning of the parameters defining the model is carried out on an independent data set (different from the training one). A set of labeled examples called the validation set is extracted from the original data and held separate from the training data subset in order to compute the classification quality measures (validation error, etc.). Predictions of class memberships are performed on the validation set ignoring the actual known class label, so that the agreement between the true and predicted class assignments could then be checked. An optimization process allows the user to determine the best parameters for the classification task. Figure 2.2: Example of an overfitting situation for a binary classification problem. The green discriminating boundary perfectly separates red and blue points by overfitting this training data. The classifier shown in black is allowing some training errors but will then be able to predict in a more robust way the class labels for a new set of data. Another splitting of data is mandatory if one desires to assess the generalization error of the selected model (model assessment). An independent test set should be used, when at all possible, to assess the true performance of the model. The performance is this way estimated on some independent data, reproducing the future behavior in a new situation. In fact, it is not fair to report the performance obtained on the previously used validation set as a model success measure because the learning machine is biased favorably to this data (parameters perfectly tuned for this set) [17].
  • 22. Chapter 3 Support Vector Machines for classification This chapter will focus on the learning machine that is at the core of almost every step of the analyses performed in this thesis. The system implementing in an efficient and robust way the training of a classifier of the supervised type is Support Vector Machines (SVMs). Moreover, SVMs adhere to the guidelines provided by the Statistical Learning Theory discussed in section 2.2 of the same chapter. Section 3.1 will examine how and why a linear decision function can optimally be used as a foundation for the classification task when applied in a high dimensional space induced by the kernel expansions delineated in section 3.2. 3.1 Large margin linear classifier 3.1.1 Optimal separating hyperplanes When dealing with a problem where different objects have to be divided in two categories by placing a discriminating boundary, the most intuitive option is to draw a separating line. This is exactly the principle applied by SVMs. More generally, in a N-dimensional space, the line becomes a hyperplane f(x) = wx+b. The input vector x of dimension N ∈ RN is multiplied by a weighing vector w which needs to be optimized along with the scalar b. In 2D (2 variables x1 and x2 describing the examples) the resulting function gives an equation of a plane of coordinates (f(x), x1, x2). If a horizontal plane is defined at the height of the level curve f(x) = 0 linearly separating the data points and if these vectors are labeled following the sign of the function f(x) it means that they are classified either in the positive class if lying above the f(x) = 0 surface or, otherwise, in the negative class (below the horizontal plane). In order to construct an optimal hyperplane for a linearly separable case, let us define some strict conditions for the class-labeling task it carries out. For the training dataset, the 12
  • 23. 3.1 Large margin linear classifier 13 values of the decision function f(x) should respect wxi + b ≥ 1, if yi = 1 wxi + b ≤ −1, if yi = −1. (3.1) A positive sample (yi = 1) should therefore be associated with a decision function strictly greater than or equal to 1 and, on the other hand, a negative input (yi = −1) should be given a value less than or equal to −1. These two parts of equation (3.1) can be merged into yi(wxi + b) ≥ 1. (3.2) This formulation tells us that there should not be any training vector lying in the region where the hyperplane takes values between +1 and −1 and that only few points would lie exactly on the level curves of height +1 or −1. As it can be seen in figure 3.1, the samples located on the level curves are called support vectors (SVs) and the region between the positive one (f(x) = 1) and the negative one (f(x) = −1) is referred to as the margin of width . Obviously, the decision boundary between the two classes becomes the hyperplane f(x) = wx + b = 0. w x + b = 0 Figure 3.1: Geometrical representation (2D) of the location of SVs and the consequent class margin placements. Following [19]. The goal of a classifier is to generalize the rules learned from the training data to situations where new instances have to be classified. Thus, if one tries to place the separating hyperplane in such a way that the most of the new data points will be found on the correct side of the class boundary, the solution consists in looking for the largest possible margin. The small margin hyperplane visible on the left side of figure 3.2 is correctly splitting the training points (solid colored marks) of the two classes (circles vs. crosses) but when testing examples (grey marks) are introduced it reveals poor generalization ability (many misclassification errors). On the contrary, the large margin obtained on the right side is robust and is more likely to classify the new samples correctly. The width of the margin can be easily computed as = w kwk (x+ − x−) = wx+ − wx− kwk = (1 − b) − (−1 − b) kwk = 2 kwk , (3.3)
  • 24. 14 3 Support Vector Machines for classification x2 x1 x1 x2 Figure 3.2: The introduction of the testing samples (in grey) leads to many classification errors when the margin is not optimized (left figure). Modified after [19]. where w is the vector defining the hyperplane, x+ is one of the positive class SVs (contributing to the margin definition) and x− is a negative class SV. As shown by equation (3.3), the goal is to minimize kwk. This intuitive minimization problem is theoretically justified by the insights of the Statistical Learning Theory [39]. It is stated that the complexity h of the set of functions is bounded by h ≤ min(R2 kwk2 ,N) + 1, (3.4) where R is the radius of the smallest sphere enclosing all the training vectors belonging to RN. Consequently, a large margin, implying a small kwk, contributes at keeping low, pertinent and thus efficient the capacity of the model. 3.1.2 The optimization problem In order to accomplish the training of the machine, we are faced with an optimization problem. SVMs provide a performing algorithm to maximize margin (3.3) whilst respecting constraints (3.2). Taking advantage of the concepts of the constrained optimization paradigm (Lagrangian theory), developed by Lagrange at the end of the 18th century, and the extensions provided in the 1950s by Kuhn and Tucker, the following results can be derived. After having introduced the Lagrange multipliers i ≥ 0 associated with the training inputs i one can express the so-called primal formulation of the optimization problem (primal Lagrangian) as LP = 1 2 kwk2 − XL i=1 iyi(wxi + b) + XL i=1 i. (3.5) Since we are looking for the maximal margin (minimal kwk), the task consists in mini-mizing (3.5) with respect to w and b. Because of the convexity of function LP , this is done by searching for the values at which the associated derivatives (3.6) vanish. @LP (w, b,) @b = 0, @LP (w, b,) @w = 0, (3.6)
  • 25. 3.1 Large margin linear classifier 15 The provided results XL i=1 iyi = 0, w = XL i=1 iyixi, (3.7) can be substituted to the primal form to get the dual formulation of the problem LD = XL i=1 i − 1 2 XL i,j=1 ijyiyjxixj . (3.8) At this point one finds the parameters i by maximizing (3.8) with respect to these same i subject to the constraints PL i=1 iyi = 0 and i ≥ 0, i = 1, . . . ,L. The cited task actually consists of a quadratic programming problem (quadratic objective function with linear constraints). The solution of the optimization problem allows the final SVM decision function to be formulated as f(x) = XL i=1 yiixxi + b. (3.9) The predicted class label (+1 or −1) is simply assigned following the sign of (3.9) when dealing with a binary classification task. If the input vectors belong to more than 2 classes, the solution consists of combining several binary classifiers with either a one class-vs-all approach or a one-vs-one approach. This multi-class extension of SVMs is accurately described in [34]. A comprehensive and clear description of the optimization problem and its resolution, summarized in this section, can be found in [34] and [9]. 3.1.3 Support Vectors and their relevance The main outputs of the training procedure of the SVM are the values i. Looking at equation (3.9), one can see that these coefficients are the weights given to each training vector xi. However, only a small proportion of them receive a non-zero i. Thus, only a subset of the initial training set is truly influential in the evaluation of the decision function. These informative points are called support vectors and, conforming to the situation depicted by figure 3.1, they lie on the margin (positive or negative side according to their label yi). For the support vectors, the inequality (3.2) turns into yi(wxi+b) = 1. Given that such a subset is the only fraction of the data that participates in the prediction, the same result would be achieved if all the rest of the points were withdrawn from the training set before training the system. 3.1.4 Soft margin adaptation In subsection 3.1.1, figure 3.1 shows a linearly separable situation where the two classes are not overlapping: the training examples are described by inputs that can be partitioned by a hyperplane. Clearly this is an ideal situation one will rarely be dealing with. In reality, usually data are noisy so that it is impossible to avoid training errors when drawing a separating line. These considerations lead to a slightly different formulation of the large margin classifier, the soft margin classifier. The “hard” margins presented with (3.2) on page 13 are “softened”
  • 26. 16 3 Support Vector Machines for classification by means of the slack variables i. The intuition consists of letting noisy training samples (lying outside the class level curve +1 or −1) fulfill the requirements as yi(wxi + b) ≥ 1 − i. (3.10) In this way, positive (negative) vectors can be associated with a decision function which does not have to be strictly larger (smaller) than 1 (−1). For example, a sample lying on the wrong side of the decision boundary wx + b = 0 will be given a i 1 so that it will then be treated as a coherent class member. Figure 3.3 shows for which inputs the slack variables have to be introduced. Figure 3.3: Slack variables i are assigned to noisy samples lying outside their class margin. Following [19]. In order to keep a low empirical error (2.5) one should, of course, force the algorithm to assign non-zero i values to as few as possible of the training samples. Therefore, in the optimization process, the initial functional (3.5) that has to be minimized is modified so that its first term is substituted by 1 2 kwk2 + C XL i=1 i. (3.11) The left term in (3.11) is the one the procedure had to minimize for the finding the largest possible hard margin. The added right term, which also has to be minimized, contributes to assess the number and relevance of misclassification errors in the training set. The weighing constant C (cost) allows the user to control these kind of errors during the training phase and conveys the confidence the user has in the data. With a large value of C, implying the belief that the dataset is not noisy, every misclassified example is heavily penalized, leading to a very small training error. The drawback is that such a great impor-tance conferred to the training data will give rise to the overfitting phenomena due to the complexity of the applied model. From this point of view, the inverse of C could then be in-terpreted as a regularization constant. Furthermore, parameter C turns out to be, concerning the quadratic programming problem, the upper bound for the i, so that 0 ≤ i ≤ C, ∀i. So, the minimization of the first term of (3.11) corresponds to lowering the upper bound for the VC-dimension (see equation (3.4)) controlling the confidence interval described in
  • 27. 3.2 Kernel expansion 17 equation (2.6) on page 9. The second term of this functional mainly controls the empirical error which appears in (2.6). Finally, both terms contribute to keep the expected risk to low values since also the second term of (3.11) is able to suggest the use of a simple model (small h) if one chooses a low value of C. 3.2 Kernel expansion 3.2.1 The principle Up to this point, we have seen how a linear decision function can be optimally applied to classify our examples with binary labels. When dealing with challenging data sets where the input/output relationships are non-linear, we need a more clever way to discriminate the two classes. The bright idea is to map the dataset into a space of higher dimensions and then perform the well-known linear separation on the transformed data, rather than applying complex decision functions directly on the initial data set. This is possible since we have seen in equation (3.9) on page 15 that, for the linear case, the decision about the class membership of a new sample x depends only on a dot product between this input vector and all the training samples xi. Thus, the intuition, called the kernel trick, is to substitute the dot product with a kernel function K(·, ·) involving the same two vectors, so that the final decision function changes to f(x) = XL i=1 yiiK(x, xi) + b (3.12) This is the final formulation of the decision function for a classification task carried out with SVMs. The function K(·, ·), for ease of simplicity referred to as the kernel, carries out the higher dimensional space mapping, not directly by generating the longer coordinates vector out of the two samples, but in an implicit way. The result of the dot product involving the mapped vectors, (x) and (xi), is equal to the output of the kernel computed with the low-dimensional vectors as inputs. x · xi7→ (x) · (xi) = K(x, xi) (3.13) Using the machine learning vocabulary, we refer to the original space as the input space, whilst we name the kernel-induced one the feature space. 3.2.2 A concrete example As a demonstration, one can try to compute the polynomial kernel of degree 2, defined as K(x, xi) = (x·xi +1)2, for a couple of inputs belonging to R2, u = (u1, u2) and v = (v1, v2). One finds out that, as shown by equalities (3.14), the application of such a kernel on u and v results in the same sum of terms that one would have obtained if a simple dot product between two high-dimensional mappings of the original vectors. The mapping that we refer
  • 28. 18 3 Support Vector Machines for classification to is the following: (u) : (u1, u2)7→ (u21 , u22 ,√2u1u2,√2u1,√2u2, 1), resulting in a feature space of 6 dimensions, i.e. (u) ∈ R6. K(u, v) = K(u · v + 1)2 (3.14) = u21 v2 1 + u22 v2 2 + 2u1v1u2v2 + 2u1v1 + u2v2 + 1 = (u21 , u22 ,√2u1u2,√2u1,√2u2, 1) · (v2 1, v2 2,√2v1v2,√2v1,√2v2, 1) 3.2.3 Valid kernel functions The rapid developments in the field of kernel methods have brought a wide range of different kernel functions that can be successfully applied. However, it is important to recall that not every function involving two vectors constitutes a kernel. In fact, valid kernels have to fulfill the so-called Mercer’s conditions (see [9]). In a few words, these constraints must be met for a selected function K(x, xi) to act as a kernel associated with the desired feature space (output equal to the dot product of the mapped vectors). Strictly speaking, this means that the kernel matrix K = K(xi, xj) n i,j=1 has to be symmetric and positive semidefinite (possess non-negative eigenvalues). Matrix K, also known as the Gram matrix, has as elements the outputs of the kernel function for every pair of input vectors (xi, xj). Additionally, user defined kernel functions can be created by multiplying or adding valid kernels since the resulting functions also respect Mercer’s conditions. If K1(·, ·) and K2(·, ·) are kernels • aK1(·, ·) + bK2(·, ·) for a, b 0 • K1(·, ·)K2(·, ·) are valid kernels as well (proof available in [35]). These properties allow us to construct some composite kernels which will then be useful to improve classification performance (see [4]). Here we present a list of the most frequently used applicable kernel functions: • Linear kernel: K(x, xi) = x · xi (3.15) • Polynomial kernel: K(x, xi) = (x · xi + 1)p , p ∈ N (3.16) • Gaussian RBF kernel: K(x, xi) = e− (x−xi)2 22 , ∈ R+ (3.17) The first item, the linear kernel, corresponds to the situation where the kernel trick has not been applied, while the second one illustrates the general form (degree p as option) of
  • 29. 3.2 Kernel expansion 19 the polynomial kernel brought into play in subsection 3.2.2. The last kernel mentioned, the Gaussian Radial Basis Function kernel, will be discussed in more detail in the next subsection. It is interesting to point out that the choice of the kernel type also allows the user to control the complexity of the model (bound on risk (2.6) presented on page 9) since the VC-dimension h also varies according to the feature space into which the inputs are mapped. In fact, the linear separation performed by the SVM algorithm is executed in the N-dimensional feature space resulting in a value of h = N + 1 (see subsection 2.2.2). 3.2.4 Details on the Gaussian RBF kernel Among the available kernel functions, a user’s choice often falls on the well-known Gaussian RBF kernel due to the simple geometrical interpretation it offers. As one can see from formula (3.17), the numerator of the argument of the exponential function is nothing but a dissimilarity measure between vector x and vector xi. In fact (x − xi)2 = kx − xik2, is the squared Euclidean distance between the examples computed in the input space. By taking the exponential of its negative value, one assigns a large value only to close samples. One will notice an exponential decrease starting from a summit of 1, the output value when evaluating two identical vectors. Since the outputs K(·, ·) are included in (3.12), the latter are then weighing the training samples xi in the sum over all the L labeled instances. The labels yi (values of +1 or −1) associated with the inputs will then have different influences in the final decision function yielding a class membership for the new data point x. Moreover, the parameter , appearing in the denominator, controls the bandwidth of the Gaussian surface centered on vector x, the object of the prediction. Figure 3.4 shows how weights vary according to the kernel width , illustrating the smoothing effect of a large value of it. In fact, a small bandwidth lets only training vectors xi’s close to x in the input space contribute significantly to the final decision function. Gaussian kernel with sigma = 0.5 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 x i,1 x i,2 K(xi, 0) Gaussian kernel with sigma = 1 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 x i,1 x i,2 K(xi, 0) Figure 3.4: The Gaussian RBF kernel function K(x, xi) with x = (0, 0) and xi = (xi,1, xi,2) for a varying xi is plotted for bandwidths = 0.5 and = 1. A peculiarity of this kernel is that, contrary to the other two presented here, the similarity between the input vector x and the training inputs xi is measured as an Euclidean distance
  • 30. 20 3 Support Vector Machines for classification and not in terms of an angle in the input space. The latter is the case when one evaluates the dot product (linear or polynomial kernels) which can geometrically be interpreted as the cosine of the angle between the 2 vectors. 3.3 Parameters tuning As stated in subsection 2.2.1, the parameters (usually referred to as hyper-parameters as well) defining a good model have to be chosen through the assessment of the quality of the predictions on an independent data set. Often, when approaching such task, a cross-validation approach is chosen. This pro-cedure, precisely named leave-k-out cross-validation, consists in training the model on all the points of the training set except for a subset formed by k randomly chosen vectors. A prediction of the output is then carried out for these points, allowing, in the case of classifi-cation, a labeling comparison. The procedure is repeated until each training vector has been provisionally removed from the main set (partitioned n/k times). However, this procedure requires a computationally intensive effort when working with SVMs. In fact, such an algorithm usually necessitates the classifier to be retrained each time a new subset of points has been left out. This results in an approach applicable with poor success to big data sets. Consequently, as pointed out in section 2.3, the initial training set may be split in two parts so that a validation set could then be used to evaluate the predictions based on the learned input/output relationships. A popular validation set/training set partition is 25%/75% of the original training data. The hyper-parameters of a SVM model that have to be tuned are the cost C and the kernel parameters ( for Gaussian RBF kernel, p for polynomial kernel, etc.). Because no direct analytic function links up the variations of these parameters to the changes in the chosen performance measure, a grid search approach must be chosen. In the case of Gaussian RBF kernel, this means that a measure like the classification error (the wide range of performance measures is depicted in section 3.4) is then computed for a set of different values of C and spanning a user-defined space. We then look for the values optimizing the classification performance (lowest validation error, highest accuracy, etc.). A minimal validation error should correspond to a low percentage of SVs. Effectively, too many SVs being identified after the training procedure signifies a warning of an overfitting caused by a complex model (a too small bandwidth can be an example). In some particular cases where class counts are unbalanced (usually many more negative examples than positive ones), it is possible that the SVM decision function threshold f(x) = 0 (see subsection 3.1.2) is not the optimal one. In such situations, a threshold tuning can be carried out as well. This results in a significant improvement of the classifier performance, in terms of the selected performance measure score. However, if satisfactory results are not obtained in this manner, an additional effort is required in order to deal with such a nonstandard situation. A well-suited procedure to apply in these cases is presented in [22]: the authors propose a modification of the cost function of the SVM.
  • 31. 3.4 Binary classification quality measures 21 Predicted (Forecast) Class +1 (Yes) Class −1 (No) row totals Actual (Observed) Class +1 (Yes) hits misses observed yes Class −1 (No) false alarms correct negatives observed no column totals forecast yes forecast no total Table 3.1: Confusion matrix for binary predictions related to the forecasting of an event. 3.4 Binary classification quality measures In assessing the classification performance of a supervised learning model, particularly when dealing with a binary classification task, a broad range of categorical statistics is available (see [10]). This section will describe the main measures currently used for the case where model predictions are related to the forecasting of an event (occurrence vs. non-occurrence). Predictions results may be organized into the 2-by-2 confusion matrix illustrated by table 3.1. Given that an event may either be observed or not, and then either forecast or not by the model, 4 possible situations can be encountered: the observed event can be correctly forecast (hit or true positive) or not detected (miss or false negative), while a non-event can be incorrectly forecast (false alarm or false positive) or correctly not notified (correct negative or true negative). The following ratios are then often used: True Positive rate (hit rate) = hits hits + misses (3.18) False Positive rate (false alarm rate) = false alarms false alarms + correct negatives (3.19) As overall model success measures we can find: Overall Accuracy = hits + correct negatives total (3.20) Hanssen and Kuipers discriminant = TP rate − FP rate (3.21) Heidke Skill Score = (hits + correct negatives − exp. correct) total − exp. correct (3.22) Bias = hits + false alarms hits + misses = forecast yes observed yes , (3.23) where “exp. correct” is the expected number of correct forecasts due to random chance. This value is computed from the theoretical correct forecasts sum totals under independence assumption (of actual and predicted classes) using marginal sums as exp. correct = (forecast yes * observed yes + observed no * forecast no) total (3.24)
  • 32. 22 3 Support Vector Machines for classification The first measure, Overall Accuracy (OA, range: 0 → 1, perfect score: 1), reports the number of correct predictions over the total number of points, suggesting whether the model’s overall performance is reliable. It becomes a bad statistic if correct negatives (many non-events) are predominant since classifying every instance in the class −1 will lead to good scores. Hanssen and Kuipers discriminant (HK, range: −1 → 1, perfect score: 1), subtracts the false alarm rate from the hit rate, indicating the capacity of the current forecasting system for discriminating between events and non-events. Therefore, when non-events are the norm, this measure is very suitable because the number of false alarms will have less of an influence on the model performance assessment. Instead a higher importance is given to missed events (appearing in the denominator of (3.18)). This takes an additional importance in the case where the 2 types of errors have different costs (e.g. avalanche forecasting). False alarms are usually less damaging than misses. Heidke Skill Score (HSS, range: −∞ → 1, perfect score: 1) measures the fraction of correct predictions after annulling those forecasts which would be correct due purely to random chance. The last measure, the bias (range: 0 → ∞, perfect score: 1), does not really indicate the classification success but tries to inform about over- or under-forecasting, with values tending to ∞ and, respectively, to 0. When the selected classifier makes class membership decisions depending on scores that can be interpreted as the degree to which an example is reasonably a class member (SVMs, neural networks, etc.), some interesting graphs involving the cited performance measures can additionally be plotted. In fact, the binary classification is executed according to a defined threshold resulting in a positive class label if the score is above the threshold t (f(x) t), or in a negative one if the value is lower than t (f(x) t). The first insight is to graphically see how the model success measure changes when the class boundary varies, usually plotting the curve constructed with the points (t,measure). Moreover, as thoroughly detailed in [12], a Receiver Operating Characteristics (ROC) curve can be built. Such a plot is a 2-dimensional graph with FP rate as the horizontal axis and TP rate as the vertical axes. It efficiently represents the tradeoff between the costs and benefits of the actual classification. In these terms, if we compute the two mentioned rates for the classifications obtained with thresholds varying from their minimal to maximal values, we will be able to plot a point (FP rate,TP rate) associated with each selected threshold. Figure 3.5 shows 2 possible curves in the ROC space. The curve labeled “B” is associated with a much better performing model compared to the model that produced curve “A”. The reason is that, no matter which threshold is retained at meaningful FP rates, the resulting curve is more to the “northwest”, meaning that classifier “B” is producing higher TP rates combined with lower FP rates than model “A”. As a matter of comparison, the line joining points (0, 0) and (1, 1) corresponds to the strategy of randomly guessing a class label for every given instance to classify (if one tries to get more hits by forecasting more positive labels it also increases the number of false alarms). When looking for the best possible classification, a point in the ROC space, we might take Hanssen and Kuipers discriminant as a model success measure since this statistic is nothing
  • 33. 3.4 Binary classification quality measures 23 Figure 3.5: ROC curves associated to 2 different classifiers (“A” and “B”). After [12]. but the difference between the vertical and horizontal axis coordinates, yielding the highest value for point (0, 1). However, when comparing two systems in an overall sense, the “area under the curve” measure is a better indicator of the average performance of the classifier over all possible threshold choices (see [12]).
  • 34. Chapter 4 Extensions of the SVMs-based approach 4.1 Feature selection Feature selection methods provide the classifier with a smaller subset of variables created out of the initial set so that it can work in a lower dimensional input space with only the relevant features. This often causes an improvement of the classification accuracy since noisy and redundant features are filtered out. Moreover, the application of this kind of algorithm provides the analyst with meaningful information about the real influence or utility of each input feature used in the classification problem. 4.1.1 Methods overview Many methods have been proposed to select the best features or to reduce the input space di-mensionality. They have been reviewed in [15]. The techniques can be divided into categories, according to the manner in which they deal with the variables. Methods such as Principal Component Analysis linearly combine the original features to create new ones. The result is a set of uncorrelated orthogonal variables carrying a decreasing amount of information (variance). The user may then select only the largest variance components for the classification task, which aids in avoiding overfitting. However, no individual feature or features can be ignored since they are all included in the creation of the new set. The second category contains techniques that consider each initial feature independently, without caring about the mutual information between them. Feature ranking with correlation coefficients, a simple method described in [16], belongs to this kind of approach. Finally one finds the best performing methods, which take into account simultaneously all the input variables during the ranking/selection process. This simultaneous consideration of input variables results in a selection that is much more appropriate when the chosen classifier is a “multivariate” one (SVMs, Fisher’s linear discriminant, etc.). One such method is Recursive Feature Elimination, which is explored in more detail in the next subsection. 24
  • 35. 4.1 Feature selection 25 4.1.2 SVM-Recursive Feature Elimination In [16] Guyon et al. discuss the use of feature ranking coefficients (provided by each discussed method) as weights in the linear decision function f(x) = wx + b, where w is the vector of feature weights, x is the input and b is a bias value. The inverse reasoning can be applied as well, yielding that variable weights multiplying the related inputs can then be used as coefficients reporting the relevance of each feature. This latter consideration is exactly the motivation that justifies the Recursive Feature Elimination (RFE) procedure combined with an SVM classifier. The RFE technique belongs to the broader category of methods named wrappers (which select the best features according to a criterion of assessment related to the classifier), opposed to those named filters (which select the best features according to a criterion independent of the classifier). The RFE algorithm details differ when using a linear SVM or a non-linear one. In the following part we will first treat the linear case, while the generalization to a SVM classifier using a kernel expansion will be discussed as the last topic of this subsection. The linearly separable case The algorithm for the linear case can be summarized as follows: Inputs: Training samples with known class labels (xi, yi) repeat until every feature k has been removed – train the linear SVM and compute the weighing vector w = P i iyixi (one component per variable) – obtain the ranking criterion for each feature k as ck = (wk)2 – find the feature with the lowest c value – remove the feature values from the initial training data – update the final ranking list end repeat Output: ranked features list (first removed → less relevant) The interpretation of this procedure is that at every step of the algorithm, after having trained the SVM, the least influential feature is removed. It is worth remarking that a ranking list is already obtained after the first iteration by sorting in a decreasing order the coefficients ck. Anyway, the interest of this feature selection method is that, by running the whole RFE procedure, an optimal subset of complementary features is found which may not be the most individually relevant [16].
  • 36. 26 4 Extensions of the SVMs-based approach Generalization to the non-linear case When dealing with a non-linear SVM it is impossible to directly compute the components of vector w because the sample xi, included in a simple dot product in equation (3.9), becomes here, on the contrary, the input of the kernel function of (3.12). Therefore, the method consists of looking for the smallest change in the square of the length of vector w when removing feature k. This value, identified with W()2, is not computed directly as the norm of w, but as W()2 = kwk2 = X i,j TH, (4.1) ijyiyjK(xi, xj) = where the ’s (forming column vector ) are the weights for each training point found after the optimization task, K(xi, xj) is the kernel output (scalar) reporting the similarity between the training samples xi and xj and H is a matrix consisting of elements yiyjK(xi, xj), an extension of the Gram matrix defined in subsection 3.2.3. As proposed by Guyon et al., at each iteration, the feature to withdraw according to the final ranking criteria is selected as f = argmin k
  • 37.
  • 39.
  • 40. , (4.2) where the notation (−k) denotes that the candidate feature k has not been included in the computation of (4.1). Since the norm of the weighing vector w defines the margin (see equation (3.3) on page 13), we select the variable whose removal least changes the distance between the strict class boundaries f(x) = −1, +1. For computational facilities, at every iteration, when the variable to remove is selected from all those available, the ’s are left unchanged and only matrix H is recomputed, with every candidate ignored in turn. Moreover, this matrix is computed involving only the support vectors since only for these examples is 0. The expedients for the extension to the multi-class case of the binary SVM-RFE presented here can be found in [36]. 4.2 Probabilistic SVM output interpretation 4.2.1 Interpretations for decision support A good model should provide a decision support system with values that can be interpreted in a meaningful way by users, so that appropriate measures may be taken. The classifier presented in this chapter is constructed in such a way that the class membership for a new instance is chosen according to the values taken by the final decision function (3.12). These values can be suitably transformed by post-processing to yield an a posteriori probability (from a categorical to a probabilistic forecast). The interpretation of this kind of probabilities is the class membership likelihood of a given example. The method that endows a probabilistic output from a SVM model is presented in details by Platt in [27]. The following subsections only review the points of this theory which have been used in this thesis.
  • 41. 4.2 Probabilistic SVM output interpretation 27 4.2.2 The sigmoid transform Applied to such a case, Bayes’ rule allows us to write the posterior probability P(y = 1|f(x)) for a sample x to belong to class +1 as P(y = 1|f) = p(f|y = 1)P(y = 1) p(f) , (4.3) where f is the associated decision function, p(f) = P l=−1,+1 p(f|y = l)P(y = l) is its a priori probability, p(f|y = l) is the class conditional probability of observing value f and P(y = l) is the prior class probability (for class l). All of these probabilities can be empirically computed from the histogram estimates of the class-conditional densities. This methodology is preferred to a parametric fit of the latter because the popular Gaussian assumption is often violated. If a scatterplot of P(y = 1|f) versus f is drawn, one obtains a graphical visualization (see figure 4.1) of the positive class membership probabilities conditional to each observed SVM output (decision function f). The goal is to fit an analytically described curve to these plotted points so that when dealing with a new value f, associated to a new sample, we will be able to predict its class +1 likelihood. In particular, it turns out that a sigmoid function of the form P(y = 1|f) = 1 1 + exp(Af + B) (4.4) is, in most cases, very well suited for modeling such a relationship. A and B are the free parameters to tune with A ∈ R− (to ensure monotonicity) and B ∈ R. f P (y = 1 | f ) Figure 4.1: In this example the plus signs indicating posterior class +1 probabilities are extremely well fitted by the tuned sigmoid function. Modified after [27]. 4.2.3 Parameters tuning The method proposed in [27] consists of minimizing the negative log likelihood of the data set. For every vector its decision function fi is associated to its actual transformed class label
  • 42. 28 4 Extensions of the SVMs-based approach ti = (yi + 1)/2 (ti = 0 or 1). We thus aim at minimizing − X i ti log(pi) + (1 − ti) log(1 − pi), (4.5) where pi = 1 1 + exp(Afi + B) . One can think of this minimization as a procedure by which one looks for the best function (defined by parameters A and B) to model the posterior probability of a sample to belong to its actual class so that this value pi approaches 0 when it is a negative class point (ti = 0) or approaches 1 when it is a positive class one (ti = 1). The optimization algorithm proposed in the cited paper is issued from the well-known Levenberg-Marquardt algorithm. Parameter A controls the slope of the sigmoid whilst B controls its location along the horizontal axis. If B is equal to 0, there will be an exact matching between the 0.5 posterior probability and the f = 0 decision function threshold (sign of f providing the vector labeling decision). Furthermore, it is important to note that in order to avoid overfitting, the tuning of the parameters should be carried out on a dataset other than the training set used to produce the predictions: the validation set. 4.3 Active Learning with SVMs 4.3.1 Principles In the domain of supervised learning, the interest in the active learning techniques are due to their ability to provide the classifier with a good, informative subset of training examples. The goal is to let the machine learn the input/output relationships leading to a satisfactory classification performance from the least number of labeled input vectors. Initially, the classifier disposes of Labeled set of training samples (xi, yi) referred to as DL. At the same time, an Unlabeled set of examples (candidate samples ˜x whose class membership y is ignored) DUL is available. At each step of the active learning algorithm, the learning machine should then be able to select from DUL, without knowing the associated class labels, the data point ˆx whose addition to DL, after the identification of the true class label ˆy, will lead to the most significant improvement of classification performance (after retraining on DL ∪ (ˆx, ˆy)). Examples of applications of such data-driven techniques in the domain of environmental sciences can be found for the optimization of a monitoring network (soil pollution, radioac-tivity, etc.) [20], for the reduction of effort in the collection of ground truth data in remote sensing [37], etc. 4.3.2 Overview of the existing techniques The active learning field has been fully developing in these last years and the methods proposed within the scientific community are following one another. In this section, the
  • 43. 4.3 Active Learning with SVMs 29 main SVMs based algorithms are presented. They differ essentially in their sample selection criteria. The first approach, described in [26], is quite intuitive and consists in looking for the unlabeled examples belonging to DUL located in the proximity of the hyperplane separating the classes. The value of the SVM decision function f for every candidate ˜x is computed using the current training set DL and then, the one with lowest values of |f(x)| or f(x)2, denoted ˆx, is added to DL. This vector is in fact very likely to become a Support Vector, thus affecting the classification procedure. Entropy-based query by bagging, a method first proposed in [38] then comprehensively discussed in [37], takes advantage of the notion of entropy to search for the vectors of DUL whose class membership prediction is the most uncertain (i.e. located closest the decision boundary where f(x) tends to 0). Such samples are very informative and will contribute significantly to the set up of the SVM model. Several SVMs are trained on a subset issued by bootstrapping from DL and class labels for every candidate ˜x are predicted. The best candidate, ˆx, is selected as the one with the highest entropy computed for the resulting class membership probabilities. Other similar methods utilize entropy based indicators, such as the one proposed by Rajan et al. in [31] making use of the Kullback-Leibler divergence, to select the most valuable examples to include in the model. The last technique briefly illustrated here is thoroughly described in [19]. Kanevski et al. suggest the use of the following algorithm. Successively assign to the candidates ˜x the +1 and −1 class labels y and independently add them to DL. The SVM is trained on the newly created set and the weights, + and −, received by the candidate for either labeling, are stored. At this point, a sample importance measure is computed as (xi) = ( 0 , if (+ i = 0, −i = C) or (+ i = C, −i = 0) + i +− i 2C , otherwise. (4.6) One can interpret the first case, where one of the 2 weights is null and the other is equal to C, as the situation where not only does the example lie far from the margin region −1 f(x) +1 but it also lies on the wrong side of the decision boundary (misclassified atypical example) for one labeling. Hence, such vectors are not points of interest. In the remaining cases we give a relevance which is a C scaled mean value of the ’s. This indicator reports the actual average influence of a given sample in the weighted sum defining the SVM output f(x).
  • 44. Chapter 5 Avalanche forecasting as a spatio-temporal classification problem 5.1 Avalanche data from Scotland: the Lochaber region case study The case study, core of this thesis, that will be illustrated in the next chapters, deals with the integration of spatial information into a temporal avalanche forecasting model based on SVMs (see [29]) for the Lochaber region, Scotland (map in figure 5.1). Therefore, the goal is to produce spatially varying avalanche forecasts at the level of single avalanche paths existing in this renowned mountaineering area. The latter is one of the 5 ski venues in Scotland for which forecasting is carried out on a daily basis during the winter season. The region includes Ben Nevis, the highest mountain in the UK with a highest point of 1344 m above sea level. Additional detailed information about the avalanche forecasting in Scotland can be found on the official website of the sportScotland Avalanche Information Service, www.sais.gov.uk. In the considered area, a nearest neighbor model called Cornice is used operationally to assist in forecasting the avalanche days [30]. This system has been briefly presented in subsection 1.2.2, where the prior work carried out on the Lochaber dataset has been reviewed. Current weather and snowpack conditions are described by a set of 9 variables that are measured or estimated by local forecasters on the slopes or that are registered by an automatic weather station (AWS). The list of the available variables is presented in table 5.1, along with the class/category that each variable is assigned to fall under. As described in [24], the 3 classes of factors influencing an avalanche release in order of decreasing influence are: Class I - stability factors, Class II - snowpack factors, and Class III - meteorological factors. None of these variables belong to the group of factors that is the most related to avalanches, Class I. Stability factors are measured in some cases via stability tests, ski-triggering, etc. by the avalanche experts when executing snow profiles at the pit site. However, it is dif- 30
  • 45. 5.1 Avalanche data from Scotland: the Lochaber region case study 31 Figure 5.1: Northern part of UK: the Lochaber region is shown labeled with a red balloon. Variable Class Description Snow index III Ordinal index of the precipitation as fresh snow on a day, appreciated by the forecasters in the field Rain at 900 m III Binary variable indicating if rain is falling at 900 m, the alti-tude of the AWS (“1” value, “0” otherwise) Snow drift II Binary variable taking “1” values when experts remark snow drifting during the observation period (“0” otherwise) Air temp III Midday air temperature at the AWS measured in ◦C Wind speed III 24 hours vector mean speed from the AWS reported in m/s Wind direction III 24 hours vector mean wind direction from the AWS reported in ◦ Cloud cover III Cloud cover as percentage of the sky Foot penetration II Penetration of the foot in the snow measured in cm at the pit site by forecasters Snow temperature II Snow temperature at a depth of 10 cm at the pit site measured in ◦C Table 5.1: List of the 9 meteorological and snowpack variables recorded daily in the winter season. ficult to include in the model the information from these tests in terms of variables that can be automatically compared when building the model. The computation of Euclidean distances between all the possible pairs of vectors is not possible if the involved variables are non-numerical or if for some days data are missing. The pit site where different snowpack factors are measured is chosen every day at some different distinct location by the forecasters based on their experience. Such a testing place, usually located on the critical slopes of one of the gullies, is assumed to be representative of the average conditions that can be found in the entire region. The 9 variables have been recorded every day of the winter season (roughly 4 months per year) since 1991, as well as the avalanche events observed in the region. Such occurrences are documented with a description of the release type (natural or triggered by mountaineers, dry or wet snow, cornice triggered, etc.), notes about injuries or specific conditions related to the event and spatial information about location (easting and northing), altitude, slope and
  • 46. 32 5 Avalanche forecasting as a spatio-temporal classification problem aspect. Only the location coordinates were known for every case, so in order to uniformly characterize the events, we had to resort to a Digital Elevation Model (DEM) with 10 meters resolution to obtain the complete set of spatial inputs needed (elevation, slope, aspect). This procedure was also taken due to typing errors or to some subjective imprecise judgments in the records. The hillshade showing the relief is issued from the elevation grid and is presented in figure 5.2 along with the location of the recorded avalanche events falling on the DEM surface. Data from 1991 to 2007 period were available for this study: information about 688 avalanche events occurred in 47 different avalanche paths was used. The subset of the events located in the area covered by the DEM grid (40 gullies) are reported in figure 5.3. Legend Observed avalanches 0 500 1’000 1’500 Meters Figure 5.2: Locations of the 593 documented avalanche cases occurred in the DEM-covered area Out of the 593 avalanche events falling on the DEM surface, 224 (37.8%) have been observed in the Ben Nevis sector (cluster of points in the south-western part of map 5.2), 347 (58.5%) occurred in the Aonach Mor range (eastern part of the Lochaber region) while 22 of them (3.7%) took place on the slopes of the range of Carn Morg Dearg summit (center of the map). It is, however, essential to remark that these mentioned events are mainly documented by avalanche experts of the region during their daily outdoor activity and by climbers or mountaineers supposed to be reliable witnesses of the release (online recording forms on www.sais.gov.uk). Therefore, when working with these avalanche reports one has to keep in mind that the list of events is not at all comprehensive. On bad visibility days, spotting a release is difficult and since snowfalls are quite often related to such conditions, it is very likely that many avalanches have taken place without being observed, either by forecasters or by mountaineers.
  • 47. 5.2 Set up of the spatio-temporal classification problem 33 Legend Gullies 0 500 1’000 1’500 Meters Figure 5.3: Locations of the 40 gullies (avalanche paths) in the DEM-covered area Furthermore, the reporting of avalanches is done much more thoroughly in the Aonach Mor range because of the easy accessibility of the slopes. In fact there are several ski runs with associated lifts belonging to the Nevis Range resort. 5.2 Set up of the spatio-temporal classification problem As we saw in the introductory section 1.2, the temporal forecasting carried out in the Lochaber region is executed by considering days with observed avalanche activity as pos-itive examples (class +1) and safe days with no observed avalanches as negative samples (class −1). Thus, the instances being classified are the days of the winter season. The described set up has then to be extended to the case where one is interested not only in correctly forecasting avalanche days but also in predicting the locations of the events. Initially, we assign to the positive class the vectors characterizing (spatial location, weather and snowpack conditions) the observed avalanche events. Details about how the features were built are presented in the next section 5.3. To complete the binary classification problem, a negative class is needed as well. The chosen intuitive approach is to let the class with −1 label be composed by all the 47 gullies (actually the 40 covered by DEM information) susceptible to give rise to an avalanche release on a safe day. Therefore, for every day of the winter season when all the variables listed in section 5.1 could be measured and the visibility allowed avalanche observations but no event was actually documented in the region, we computed all the features describing the local conditions at each avalanche path. These spatially variable features were then combined with the global ones related to the current safe day, concerning the whole mountain domain.
  • 48. 34 5 Avalanche forecasting as a spatio-temporal classification problem In this way, a broad list of negative instances was produced to be given to the learning machine. The purpose of this was to let the classifier train on a set of critical situations which were close to the “safe/event” decision boundary and likely to cross it under slightly different weather conditions. Finally, this results in a binary classification problem where the vectors to discriminate are daily avalanche activities of the dangerous paths (gullies) located in the forecast area. This becomes a very unbalanced classification task because, as shown by table 5.2, the negative inputs our model will be given considerably outnumber the positive ones (by a factor of approximately 68 to 1). Class +1 Class −1 Avalanche events Safe gullies 667 45240 = 40 gullies ·1131 safe days Table 5.2: Dataset positive and negative classes resulting in an unbalanced classification problem. 5.3 Choice and conception of the input features In order to get the desired spatio-temporal forecast, the series of daily measurements of meteorological conditions related to snowpack stability described in section 5.1 have to be smartly combined with the spatial description of the terrain morphology available via the DEM of the region under study. The latter, with its relatively high resolution of 10 me-ters, provides detailed information about elevation, slope and aspect of the paths where the avalanche events could happen. This results in a “spatialized” set of local condition features with changing values according to the location of the avalanche release point. Additionally, for some of the temporal variables, information about avalanching conditions recorded in the previous days were also included (2 pre-days at most because of the rapidly changing weather conditions). Therefore, the features created, considering the advices of the avalanche experts of the Lochaber region as well, are designed to account for the relevant factors influencing avalanche activity. The final input vector counted 39 features: 22 spatio-temporal features (describing local conditions at a given gully or at the release zone) and 17 temporal features with global validity (the same for all gullies). The complete list, with a brief description of the meaning of each variable is presented in table 5.3. For a subset of features (names tagged with *), additional details about how a given feature has been created are provided hereafter. The first type of variables requiring some further explanation are the ones involving the sine and cosine transforms. These features take as input either the wind direction or the aspect. Since these kinds of variables report a direction measured in degrees ranging from 0 to 360, clockwise starting from north, it is clear that they can not be compared directly by a subtraction when looking for dissimilarities (e.g. Gaussian RBF kernel). For example, two slopes with very low or very large values will be both north facing slopes. We get around this peculiarity by taking a sine transform which will project the directions on the “horizontal”