MasterThesisFinal_09_01_2009_GionaMatasci

Faculty of Geosciences
and Environment
Support Vector Machines for Spatio-Temporal
Avalanche Forecasting
Giona Matasci
Master of Science in Environmental Geosciences
Supervisors: Experts:
Prof. Mikhail Kanevski Dr. Ross Purves
Dr. Alexei Pozdnoukhov Devis Tuia
January 2009

Title page image:
Aonach Mor cornices, source:
saislochaber.blogspot.com

i
Abstract
Statistically based methods for avalanche forecasting have been widely developed in many
regions subject to this kind of natural hazard to detect avalanche days. Such techniques are
often based on simple supervised classification methods like Nearest Neighbors and only focus
on the temporal component of the avalanche activity. The purpose of this Master thesis is to
build a reliable spatio-temporal forecasting model that is able to efficiently integrate spatial
information about avalanche events. The application of machine learning algorithms for
patter recognition, namely Support Vector Machines, is demonstrated with a case study on
a dataset from Lochaber, Scotland, UK. Encouraging results were obtained in this extension
of the usual forecasting procedure.
The meteorological and snowpack factors globally describing avalanche likelihood in the
mountain area have been combined with spatial features (issued from a Digital Elevation
Model) related to the avalanche paths where the events have been observed. Hence, thanks
to a huge database consisting of 17 years of daily condition observations matched with release
occurrences, we could develop an excellent decision-support tool to assess the avalanche
danger with a considerable spatial resolution (gullies, particular slopes, etc.).
Interesting results, expressed in terms of confusion matrices related to the predictions
on a test dataset (forecasts of gullies avalanche activity) as well as avalanche danger maps,
are presented in this research report. Besides, the behavior of the model in discriminating
safe/risky situations when dealing with critical changing conditions affecting the snowpack is
proven to be consistent after a perceptive validation based on the analysis of some observed
cases (a specified avalanche path on a given day). Moreover, the use of SVMs auxiliary
techniques allowed to automatically highlight the most meaningful features to include in sta-tistical
models aimed at successfully predicting avalanche releases in time and space. Finally,
always taking the same state-of-the-art learning machine as starting point, elements of the
sensitivity of the model and suggestions concerning a possible improvement of the avalanche
monitoring procedure are also provided.
Keywords: statistical avalanche forecasting, natural hazards, spatial and temporal ap-proach,
machine learning, supervised classification, kernel methods, Support Vector Ma-chines,
Nearest Neighbors, feature selection, active learning, sensitivity analysis, avalanche
path, GIS mapping, Lochaber region

ii
Résumé
Les méthodes statistiques de prévision d’avalanches ont été largement développées dans de
nombreuses régions sujettes à ce type de danger naturel. Ces techniques sont souvent fondées
sur de simples méthodes de classification supervisée comme celles des Plus Proches Voisins
(Nearest Neighbors) et se concentrent seulement sur la composante temporelle du danger
d’avalanches. Le but de ce travail de Master est de construire un fiable modèle de prédiction
au niveau spatio-temporel capable ainsi d’intégrer efficacement des informations spatiales sur
les épisodes d’avalanche. L’application d’algorithmes d’apprentissage automatique (machine
learning) pour la reconnaissance des formes, à savoir celui des Séparateurs à Vaste Marge
(Support Vector Machines), comme il a été démontré avec un cas d’étude concernent la
région de Lochaber en ´ Ecosse, Royaume-Uni, a révélé des résultats encourageants dans cette
extension des procédures habituelles de prévision.
Les facteurs météorologiques et ceux liés au manteau neigeux décrivant globalement les
conditions d’avalanche ont été combinés avec des informations spatiales (sorties d’un Modèle
Numérique de Terrain) liés aux couloirs d’avalanche où les événements ont été observés. Ainsi,
grâce à une vaste base de données constituée de 17 années d’observations quotidiennes des
situations d’avalanche et déclenchements associées, nous avons pu obtenir un excellent outil
d’aide à la décision pour évaluer le danger d’avalanche avec une bonne résolution spatiale
(ravines, types de pentes spécifiques, etc.).
Des intéressants résultats en termes de matrices de confusion liés aux prédictions sur un
ensemble de données de test (prévisions de l’activité avalancheuse des différents couloirs),
ainsi que des cartes de danger d’avalanche sont présentés dans ce rapport. En outre, le
comportement du modèle lors de la discrimination des situations sûres de celles à risque,
dans le cadre d’une évolution critique des conditions affectant le manteau neigeux, s’est
avéré être très satisfaisant. Cela après une validation perceptive basée sur l’étude de cas
réellement observés (un couloir d’avalanche bien défini en un jour donné). En outre, le recours
à des techniques auxiliaires liées aux SVMs a permis de mettre en évidence automatiquement
quelles sont les variables les plus importantes à inclure dans les modèles statistiques visant à
prédire avec succès les avalanches dans le temps et dans l’espace. Enfin, toujours en utilisant
la même performante méthode d’apprentissage supervisé comme point de départ, des éléments
sur la sensibilité du modèle et des suggestions concernant une éventuelle amélioration de la
procédure de contrôle des avalanches sont également fournis.
Mots clés: prévision d’avalanches statistique, dangers naturels, approche spatiale et
temporelle, apprentissage automatique, classification supervisée, méthodes à noyaux,
Séparateurs à Vaste Marge, Plus Proches Voisins, sélection des variables, apprentissage
actif, analyse de la sensibilité, couloir d’avalanche, cartographie SIG, région de Lochaber

iii
Acknowledgments
First and foremost, I am grateful for the advice and support of both my supervisors during
the whole Master program.
I thank Prof. Mikhail Kanevski for having introduced me to the field of machine learn-ing
as well as its applications to environmental sciences and for his interest in my research.
I would like to thank Dr. Alexei Pozdnoukhov, first, for his huge availability and patience
when supervising me, then, for having guided me with throughout this thesis with construc-tive
suggestions about the topics to focus on. His great help when dealing either with the
theoretical aspects of the methods used or with their concrete implementation will not be
forgotten. Thank you Alexei!
Dr. Ross Purves is acknowledged for the interesting discussions about avalanche forecasting
in Scotland and for the useful hints provided.
Moreover, I also appreciated a lot the aid and ideas given to me by Devis, Loris, Fr´ed
and the rest of the team of the geomatics group at IGAR during the work for my Master
thesis.
A big and deep “grazie” is addressed to my family, in particular to my parents Franca
and Sandro, for the support they provided me during these years spent at the university in
Lausanne.
All my friends scattered in Switzerland as well as the “sp´ecialisation 2” crew of the Master
deserve gratitude for the funny moments spent together during this period.
Last but not least, I am grateful to the “US” relatives, namely Louis and Caroline, for
proofreading the English.
...and all those I forgot, thank you!

Contents
1 Introduction 1
1.1 Objectives and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Prior work on data-driven statistical avalanche forecasting . . . . . . . . . . . 2
1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Prior work on avalanche forecasting in the Lochaber region . . . . . . 3
2 Machine Learning 6
2.1 Supervised learning vs. unsupervised learning . . . . . . . . . . . . . . . . . 6
2.1.1 Nearest Neighbors for classification . . . . . . . . . . . . . . . . . . . . 7
2.2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Structural Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Model selection and model assessment . . . . . . . . . . . . . . . . . . . . . . 11
3 Support Vector Machines for classification 12
3.1 Large margin linear classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Optimal separating hyperplanes . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 The optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Support Vectors and their relevance . . . . . . . . . . . . . . . . . . . 15
3.1.4 Soft margin adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Kernel expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 The principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 A concrete example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Valid kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 Details on the Gaussian RBF kernel . . . . . . . . . . . . . . . . . . . 19
3.3 Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Binary classification quality measures . . . . . . . . . . . . . . . . . . . . . . 21
4 Extensions of the SVMs-based approach 24
4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Methods overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 SVM-Recursive Feature Elimination . . . . . . . . . . . . . . . . . . . 25
4.2 Probabilistic SVM output interpretation . . . . . . . . . . . . . . . . . . . . . 26
iv

CONTENTS v
4.2.1 Interpretations for decision support . . . . . . . . . . . . . . . . . . . . 26
4.2.2 The sigmoid transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.3 Parameters tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Active Learning with SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 Overview of the existing techniques . . . . . . . . . . . . . . . . . . . . 28
5 Avalanche forecasting as a spatio-temporal classification problem 30
5.1 Avalanche data from Scotland: the Lochaber region case study . . . . . . . . 30
5.2 Set up of the spatio-temporal classification problem . . . . . . . . . . . . . . . 33
5.3 Choice and conception of the input features . . . . . . . . . . . . . . . . . . . 34
6 Prediction of avalanche activity at individual paths 38
6.1 SVM training and parameters tuning . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.2 Model optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Predictions for years 2006-2007 . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.2 Comments and observations . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Avalanche danger mapping 50
7.1 Avalanche danger assessment: probabilistic SVM output tuning . . . . . . . . 50
7.2 Mapping on the prediction grid . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Gradient mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8 Extended analysis of avalanche data with SVMs-related methods 56
8.1 Relevant features choice: RFE . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.1.1 Set up of the automatic procedure . . . . . . . . . . . . . . . . . . . . 56
8.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.1.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Model behavior under changing conditions . . . . . . . . . . . . . . . . . . . . 60
8.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.3 Results and interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3 Active Learning as an exploratory tool in avalanche monitoring . . . . . . . . 66
8.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9 Conclusions 69
9.1 Main achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.2 Further work on this topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A European Danger Scale 76
B Avalanche danger maps 77

vi CONTENTS
C MATLAB code: gradient mapping 79

Chapter 1
Introduction
1.1 Objectives and motivation
The machine learning domain, presented in chapter 2, provides many scientific research fields,
especially in the last few years, with a solid framework based on a full variety of techniques
aimed at the analysis of datasets of increasing complexity and size.
Particularly, the environmental sciences area appears to be one of the well-matched sub-jects
where such methods could be applied. In fact, among the broad variety of subfields
related to geosciences, the latest progress in the automatic extraction of dependencies from
data have found a great application on the forecasting of natural hazards, a theme frequently
discussed during the attended Master program. Predictive models founded on the concepts
issued from machine learning are robust and very well suited for operational danger assess-ment
purposes.
From this point of view, the topic of avalanche forecasting shows significant potential
promising developments. The statistical approach frequently used to evaluate the likelihood
of snow releases on the slopes of a mountain (see the prior work report in section 1.2) can
be improved to obtain an extended and enhanced decision-support system helping avalanche
forecasters in their daily job. However, the main purpose of this work is to explore the possible
applications of several machine learning techniques in this research field, without focusing
particularly on the issues affecting operational aspects of forecasting. The reasons behind
such an approach are mainly related to the fact that studies joining these two domains are in
their early stages and the realization that my knowledge of the specificities of the avalanche
forecasting process is not adequate compared to that of forecasters with years of experience.
Nonetheless, the scope of this work is to build a reliable predictive model aimed at giv-ing
an efficient spatial extension of the forecasting systems originally designed to produce
predictions about global avalanche activity over a whole region. Therefore, the morphologi-cal
characteristics of the mountain range terrain affecting local scale weather and snowpack
conditions will be taken into account by the presented learning machine.
The core of the analysis is centered on the well-known supervised classification method
named Support Vector Machines (SVMs). This product of Statistical Learning Theory will
be discussed in chapter 3. The performance of such a classifier when dealing with high-
1

2 1 Introduction
dimensional data will allow the incorporation of a wide range of features describing avalanch-ing
conditions at the level of single avalanche paths. The classification problem will be set up
by matching these variables with the related actual activity of a given gully, giving rise either
to an avalanche event or to a safe situation. This spatio-temporal approach to avalanche
forecasting is described in chapter 5, while the results in terms of the classification quality of
the predictions for 2006 and 2007 winter seasons are reported in chapter 6.
While focusing on SVMs as the main root of the methodological part of the work, the
objectives of the research also consist in developing some tools, based on classical machine
learning/SVMs data-driven approaches described in chapter 4, used to highlight some prop-erties
of the avalanche hazard studied by taking into account the spatial variation of the
phenomenon. The feasibility of a mapping of the avalanche danger over the region under
study will be considered in chapter 7. Then, we attempt to identify the most useful features
to involve in the classification task by assessing their real influence on the decisions taken
by the model and on the evolution of the avalanche danger. Next, we investigate the actual
sensitivity of the model to changing meteorological and snowpack conditions. Furthermore,
some suggestions are given for the possible optimization of the information gathering pro-cedure
through improvements in the avalanche monitoring task. All these aforementioned
topics will be covered in chapter 8.
This thesis extends the previous work on this topic (see [29]) carried out by Dr. Alexei
Pozdnoukhov during his post-doctoral fellowship at the Institute of Geomatics and Analysis
of Risk (IGAR) of the University of Lausanne (information about the main research achieve-ments
on www.geokernels.org). The case study that will be treated concerns the region
called Lochaber, located in the northern part of Scotland, UK, that is subject to numerous
avalanche events during the winter season. Avalanche data collected on the slopes of these
mountain ranges were available because of the previous collaboration between IGAR and the
sportScotland Avalanche Information Service (www.sais.gov.uk) thanks to the contribution
of Dr. Ross Purves.
1.2 Prior work on data-driven statistical avalanche fore-casting
1.2.1 Overview
Avalanche forecasting is a crucial task for many winter resorts where a lot of skiers, moun-taineers,
climbers are present every day. The procedure, which results in a report of avalanche
conditions with associated danger, is carried out manually by the forecasters of the region.
These experts are in the field every day to understand the evolution of the different factors
affecting avalanche releases. Information about snowpack conditions and stability, weather
parameters and actual avalanche activity are collected by the observers on a daily basis.
Nevertheless, in some skiing venues, numerical models are available to support the de-cisions
taken based on the experience of the forecasters. Some physical models exist to aid
in the assessment of snowpack evolution (see [1] for the case of Switzerland) but, generally,
statistical approaches ensuring forecasting systems are much more commonly used. These

1.2 Prior work on data-driven statistical avalanche forecasting 3
models are devoted to the prediction of current avalanche activity by looking for similari-ties
with conditions influencing releases recorded in the past (meteorological and snowpack
factors essentially).
The statistical models currently operationally used or tested on real avalanche data are
producing temporal forecasts about global avalanche activity in a given region on a given
day. Avalanche days and safe days are discriminated using several different statistically
based techniques belonging to the supervised learning category (pattern recognition). These
methods include discriminant analysis [13], regression trees and Nearest Neighbors [3].
The last technique mentioned is widely applied for operational forecasting in many dif-ferent
countries. For example, in Switzerland the NXD system (NXD2000 and NXD-REG,
described in [14] and [2]), developed by the Swiss Federal Institute for Snow and Avalanche
Research (SLF), is used at a local and regional scale to help experts produce final avalanche
danger reports. These specialists receive as model output the 10 most similar days (included
in the database of past observed conditions) to the current day’s situation. By checking
under which conditions and in which locations avalanches have been observed on these days,
they are given concrete helpful information to use in assessing the actual avalanche danger.
The next subsection will illustrate the use of these nearest neighbors methods in Scotland.
1.2.2 Prior work on avalanche forecasting in the Lochaber region
The case study that will be discussed throughout this thesis concerns avalanche forecasting,
namely the forecasting including the spatial component of the avalanche activity, in the
Lochaber area, Scotland, United Kingdom. In this introductory part of the work I will
present a short survey of work done in this field using the same avalanche data.
Nearest Neighbors model Cornice
Purves et al. in [30] describe the Nearest Neighbors model developed for the operational
forecasting of avalanche activity in the Scottish mountainous region under study. In con-junction
with local avalanche forecasters, the scientists involved in this project implemented
a decision-support system called Cornice which is providing useful information about past
avalanching conditions helpful in producing a reliable hazard report. The forecasts are made
available in the afternoon (around 3 pm) and include a description of the situation experi-enced
during the day as well as the expected development of avalanche activity over the next
24 hours.
The model takes as inputs different meteorological and snowpack variables influencing
the release of avalanches in the region (a list of the available variables is given in table 5.1
in section 5.1). A historical database starting in 1991 is then searched. The outputs consist
of the values taken by the same input variables during the 10 most similar recorded days
(Euclidean distance of equation (2.1) of page 7 as a dissimilarity measure). Additionally,
the spatial locations of the documented avalanche events occurring during these days are
also shown on a geo-referenced map. Hence, both the causes, in terms of weather/snowpack
conditions, and the consequences, in terms of possible avalanche events, are available to the
forecasters.

4 1 Introduction
The model developers did not use subjective weighting of the inputs based on forecasters’
experience, but instead chose to implement an automated procedure to find the optimal
weights. The optimization of variables’ relevance has been carried out by means of genetic
algorithms using several fitness metrics to evaluate the ability of different sets of weights
to correctly forecast avalanche and non-avalanche days. For both the optimization of the
parameters and the verification (testing) of the model, on a given day, a forecast of avalanche
activity is produced if 3 of more of the 10 nearest neighbors were avalanche days. If this
threshold is not reached, the day under examination is forecasted as safe.
The batch testing of the model (assessing the generalization error by cross-validation)
has been carried out on 1323 days (actually 1005, because of no-visibility days), covering
the years from 1991 to 2002, in order to evaluate the agreement of model forecasts with the
observations. The results can be summarized with binary confusion matrices (contingency
tables) for which several categorical statistics can be computed (see section 3.4). The best
prediction performances were obtained with an optimization via either Hanssen and Kuipers
discriminant or Unweighted average accuracy, leading to an Overall Accuracy of 0.83 and to
a Hanssen and Kuipers discriminant value of 0.61. The models correctly forecasted slightly
more than 200 avalanche days, with only approximately 60 misses and 115 false alarms.
The Cornice application produced quantitative results considered very encouraging by the
authors. However, its main utility is clearly recognized as a support for the forecasters in the
information gathering and hypothesis testing process allowing avalanche danger assessment.
Support Vector Machines model
This temporal avalanche forecasting approach has been revisited by Pozdnoukhov et al. in
[29] who applied machine learning methods to the Lochaber dataset with the purpose of
increasing the accuracy of the predictions. In this work, the performing supervised classifier
called Support Vector Machine is used firstly to improve the discrimination ability in the
temporal predictive task (avalanche days vs. non-avalanche days), then is applied to make a
preliminary extension to a spatial avalanche danger forecasting.
The adopted methodology was centered on a pure data-driven approach starting with
the selection of the relevant features to be employed using the automated procedure called
Recursive Feature Elimination (see section 4.1). An initial set of 44 variables, comprised of
combinations of the variables measured on the slopes (current day features, previous days
features, expert features), was filtered out by retaining the 20 most valuable non-redundant
features for the classification task.
After SVM parameters optimization by cross-validation on the winters from 1991 to 2001,
a test of the model performance was carried out on the 712 days of observations in the
period 2001-2007. The method showed a satisfactory ability to detect avalanche days: the
Overall Accuracy reached 0.86 whilst the Hanssen and Kuipers discriminant scored 0.64. A
comparison with nearest neighbors methods applied on the same dataset demonstrated a
slight superiority of the SVM technique.
Furthermore, a transform of the SVM decision function into a probability (see section 4.2
for details about the method) allowed a reliable interpretation of the outputs of the model in
terms of the likelihood of an avalanche occurring on a given day (application to 2003/2004

1.2 Prior work on data-driven statistical avalanche forecasting 5
winter).
Given the well-known ability of this machine learning method to deal with high-dimensional
data, an additional set of spatially varying features such as altitude, slope or aspect was added
to the vector describing the avalanching conditions on a specified day. The purpose was to
characterize the local situation at each avalanche path of the Lochaber region by providing
the model with examples of about 700 avalanche events whose spatial attributes have been
documented. The authors have then been able to extrapolate the avalanche activity indicator
over the whole study area thanks to a digital elevation model (DEM).
Such a spatio-temporal approach has been presented as an early result of a procedure
needing refinements and further work aimed at the assessment of the validity of the results.
This initial work as well as some improvements (spatial distribution of some meteorological
features such as wind fields, etc.) already put into practice by the cited researchers (see [28])
is taken as a starting point for this thesis.

Chapter 2
Machine Learning
The broad research field of machine learning, rapidly developing in the last decades, is often
described as a subtopic of computer science whose outlining concepts and ideas derive from
closely related domains such as statistics and artificial intelligence.
The notion of machine learning can be presented, in an overall view, as a collection of
techniques that are able to “learn” the dependencies existing in the data affecting a given
predictive task from examples (tasks description in section 2.1). The different methods are
designed so that the learning procedure takes place in an automatic and data-driven way.
This means that, in general, no human prior knowledge or assumptions concerning data
probability distributions are used during the process. For obtaining a good foundation on
the topic and for additional information [6] is suggested.
The fields, with related real-world applications, concerned by these state-of-the-art tech-niques
are countless. The ones which have been earlier involved include bioinformatics/biom-etry
(biosequences analyses), chemistry (cheminformatics/chemometrics), medicine (diag-noses),
data mining (financial data), web and text mining (text or webpages categorization),
speech and hand-written character recognition, etc.
Nevertheless, the development of the research in the area of environmental sciences took
place only later on with applications in domains such as spatial interpolation, remotely sensed
images classification, etc. (see [17], [18]). In fact, the geo-spatial phenomena modeling would
benefit a lot from the operational use of the latest breakthroughs occurred within the machine
learning community. Avalanche forecasting in particular, the topic of this thesis, is one of
the geosciences domains for which machine learning methods show much promise [29].
2.1 Supervised learning vs. unsupervised learning
Machine learning methods may be classified into the categories of supervised and unsupervised
learning.
Supervised learning can be thought of as a process by which a learning machine is guided
throughout a training procedure to learn the input/output relationships existing in the data
set. These examples are called the training data. Each individual sample/example is de-scribed
by an input vector x belonging to RN, usually referred to as the input, and presents
6

2.1 Supervised learning vs. unsupervised learning 7
a related known output y. This means that each sample can be represented as a vector in an
N-dimensional space (N variables). Depending on the type of y value one can define the task
as a regression problem or a classification problem (pattern recognition). In the first case the
output associated with a given input is a real value y ∈ R. In the second case, with which
this thesis will be dealing with, output values are discrete, resulting in a binary classification
task if y ∈ {−1, 1} or in a m multi-class classification task if y ∈ {1, 2, . . . ,m}. The learning
machine, after having seen all L training examples {(x1, y1) , . . . , (xL, yL)}, then provides an
estimate of the original function y = f(x) mapping the inputs to the output domain.
The other learning approach may be termed unsupervised learning. In this case the
learning machine is not provided with the outputs y and the method goal is to extract
information about the process which generated the data. The main types of this kind of
learning are clustering (also known as cluster analysis) and density estimation. The first one
listed is concerned with the grouping of the data points into clusters whose members have
similar characteristics, without knowing their true class labels. Density estimation methods
attempt to model the underlying probability distribution of a certain observed phenomenon.
Combinations of the supervised and unsupervised domains are also possible resulting in
semi-supervised learning, an approach where labeled and unlabeled examples are provided at
the same time to the learning machine. A summary of these hybrid techniques, implemented
to make use of all the available information in order to improve the predictive model, can be
found in [5].
The present thesis is mainly dealing with the supervised approach for binary classification
problems. The chosen learning system, and its associated tools, is known as Support Vector
Machines (SVMs). The technique is part of the subfield of machine learning referred to as
kernel methods [35]. This supervised classification method based on the so-called Support
Vectors will be detailed in chapter 3. In [11] the reader will find a comprehensive description
of other supervised learning techniques. These include Fisher’s Linear discriminant analysis,
Logistic regression, Decision trees, Multi-Layer Perceptrons, Probabilistic Neural Networks,
k-Nearest Neighbors, etc. The latter will be discussed in the next subsection (2.1.1) since it
is a benchmark method widely used in avalanche forecasting.
2.1.1 Nearest Neighbors for classification
The technique called k-Nearest Neighbors (k-NN) is probably the most intuitive method to
solve a classification problem. One can reasonably think that similar inputs x, in other words
examples described by variables taking analogous values, will possess, in most of the cases,
the same output class label y. This will lead to a decision about the class membership of a
new point x based on its Euclidean distance (see equation (2.1)) to the training samples xi.
The cited dissimilarity measure between sample u and v is computed as
dist(u, v) =
vuut
XN
d=1
(ud − vd)2, (2.1)
where d is the variable index.

8 2 Machine Learning
In order to predict the class label y of the vector currently under consideration, a ma-jority
vote is set up between the k-nearest examples (k smallest distances) found in the
N-dimensional input space.
With fixed distance measure, the only parameter to tune to get the optimal accuracy
in the class label assignments is the number k of neighbors to include in the decision vote.
Essentially, choosing a low value of k corresponds to assuming that data are not corrupted
by noise (structured dataset), so that a close correspondence can be established between the
training vectors we dispose and the new ones whose label y should be forecasted. On the
contrary, choosing a large k in most cases means that we believe that the configuration of the
training examples is much unstructured, leading to a tricky input/output matching. This
will give rise to a decision process involving a larger set of neighboring examples, approaching
a simple general majority vote when k tends to N.
The approach presented here provides good results particularly for low-dimensional datasets.
Due to this success, as well as its appealing logic, k-Nearest Neighbors is often used as a refer-ence
technique. On the other hand, when dealing with many variables, this algorithm suffers
the so-called curse of dimensionality. In a high-dimensional input space, often, new samples
whose labels are to be predicted by looking at their neighborhood are found to be equally
far from the all training inputs, precluding any reliable prediction.
2.2 Statistical Learning Theory
In the domain of machine learning, Statistical Learning Theory [39], also known as Vapnik-
Chervonenkis theory, first developed by V. Vapnik in the 1960-70, provides a good framework
for so-called predictive learning. The main goal of this theory is the optimal assessment of a
model according to a trade-off between its ability to honor the available information and its
complexity.
As stated in section 2.1, a supervised learning model, at the end of the training pe-riod,
retains a function executing the mapping y = f(x), typically called decision func-tion
for a classification problem. This function should be chosen from a set of functions
F = {f(x,), ∈ }, where represents a vector of parameters selected from the set .
According to Vapnik’s concepts, the criteria used to evaluate the goodness of the choice of
a given function f(x), in other words the similarity to the unknown target function that
depicts the actual input/output dependencies, is the following risk functional, called the
expected risk:
R() =
Z
Q(y, f(x,))dP(x, y), (2.2)
where Q(y, f(x,)) is a task-defined loss function and P(x, y) is the unknown joint proba-bility
distribution of the examples. As can be intuitively understood the risk should be as
low as possible so that our goal is to minimize the expected average loss (2.2).
Reviewing the two main learning problems already mentioned (omitting clustering and
density estimation) let us introduce the loss function most commonly used in pattern recog-

2.2 Statistical Learning Theory 9
nition:
Q(y, f) =
(
0 if f(x) = y
1 otherwise.
(2.3)
For such a loss function, the resulting expected risk is nothing but the probability of a
classification error.
In the domain of regression problems the aim is to minimize the differences between the
actual output value y and the predicted one f(x) for every example. This is translated into
mathematical terms, in most cases, by means of the squared loss function
Q(y, f) = (y − f(x))2. (2.4)
2.2.1 Empirical Risk Minimization
Once the principles allowing us to evaluate the performance of a learning machine have been
defined, Statistical Learning Theory reminds us that, in fact, the distribution P(x, y) of
equation (2.2) is unknown so that the only known input/output pairs are those of the given
finite set of examples. The first thought is to approximate the theoretical risk functional by
an empirical one simply computed on the training examples as
Remp =
1
L
XL
i=1
Q(yi, f(xi,)), (2.5)
where L is the number of training samples. A minimization of the function, the Empirical Risk
Minimization, is then carried out in order to select the best set of parameters . However,
such a choice is strongly dependent on the examples provided to the learning machine for
training. As discussed in more detail in section 2.3, it is possible to partially circumvent this
drawback by using a cross-validation methodology or by splitting the initial dataset into 2
parts (use of an independent set of data). Additionally, always in the same section it will be
explained that when aiming at evaluating the overall performance of the learning machine
another set of examples is required.
2.2.2 Structural Risk Minimization
In the theoretical framework of Statistical Learning Theory, with the purpose of considering
the ability of a model to extend the learnt relationships to unobserved new data, the notion
of Structural Risk Minimization is introduced.
Essentially, the idea is to place an upper bound for the expected risk (2.2) which varies
according to the empirical risk and a defined confidence interval such that
R() ≤ Remp() +
s
h ) + 1) − log(
h(log( 2L
4 )
L
, (2.6)
where L is the number of training samples and h is the so-called Vapnik-Chervonenkis di-mension
(VC-dimension) of the function used [39]. The resulting inequality, which holds
with probability 1 − , reports a particular bound valid only for the classification case.

10 2 Machine Learning
The quantity h needs some further enlightenment because it is one of the main concepts of
Vapnik’s theory. For a binary classification problem, h can be interpreted as the maximum
number of samples for which a class-consistent partitioning can be achieved using the set
of functions. A two dimensional data set consisting of 3 vectors can always be separated
with a linear function, no matter which is the labeling of the points. A difficulty occurs
if the samples to shatter become 4: a chessboard-like setting will forbid any valid linear
separation. Finally, we can state that linear decision functions in RN , hyperplanes of the
form f(x) = wx + b, possess a VC-dimension of N + 1. In comparison to this, a polynomial
function of degree 2 applied in R2 has a VC-dimension of 4 and, as borderline case, for the
function f(x) = b sin(wx) this quantity is equal to infinity (high frequency for a large kwk,
allowing the separation of every possible configuration of points).
Looking at equation (2.6), it may be seen that the expected risk is minimized when the
confidence interval, the second term in the right side of the inequality, is kept small by a
low h
L ratio. By the mentioned inequality, a function with a large VC-dimension h which
perfectly fits a small number of data points L will result in a large expected risk since there
is an overfitting. Such a complex model will likely lead to an important generalization error.
Figure 2.1 illustrates how the bound on risk varies depending on model complexity.
Figure 2.1: Bound on risk varying according to the confidence interval and the empirical risk
associated to sets of models of increasing complexity. After [39].
To summarize, the Structural Risk Minimization principle provides a theoretical frame-work
for achieving the optimal trade-off between the classification accuracy on training data
and the capacity of the set of functions selected. Later on, in subsection 3.1.4 of chapter 3,
we will have a look to the concrete means the SVM algorithm supplies to handle this kind
of issue.
Next section illustrates the general procedure adopted when using a supervised learning
approach.

2.3 Model selection and model assessment 11
2.3 Model selection and model assessment
The preceding sections have discussed how Statistical Learning Theory allows the evaluation
of the performance of a model with respect to its complexity. When one is concretely applying
a supervised learning classification algorithm there are several practical considerations that
need to be respected in order to properly use the method.
First, the model selection step is crucial. The fact that the empirical error (training
error) is computed on the training examples given to the learning machine should be taken
into consideration when one is proceeding with the choice of the optimal parameters. A
model that closely or perfectly fits noisy or non-representative training data (see example of
figure 2.2), is said to overfit (in opposition to a too simple model giving rise to the situation
called underfitting). Overfitting results in a poor generalization ability of the system when
dealing with new data. It is required that the tuning of the parameters defining the model
is carried out on an independent data set (different from the training one). A set of labeled
examples called the validation set is extracted from the original data and held separate from
the training data subset in order to compute the classification quality measures (validation
error, etc.). Predictions of class memberships are performed on the validation set ignoring
the actual known class label, so that the agreement between the true and predicted class
assignments could then be checked. An optimization process allows the user to determine
the best parameters for the classification task.
Figure 2.2: Example of an overfitting situation for a binary classification problem. The green
discriminating boundary perfectly separates red and blue points by overfitting this training
data. The classifier shown in black is allowing some training errors but will then be able to
predict in a more robust way the class labels for a new set of data.
Another splitting of data is mandatory if one desires to assess the generalization error of
the selected model (model assessment). An independent test set should be used, when at all
possible, to assess the true performance of the model. The performance is this way estimated
on some independent data, reproducing the future behavior in a new situation. In fact, it is
not fair to report the performance obtained on the previously used validation set as a model
success measure because the learning machine is biased favorably to this data (parameters
perfectly tuned for this set) [17].

Chapter 3
Support Vector Machines for
classification
This chapter will focus on the learning machine that is at the core of almost every step of the
analyses performed in this thesis. The system implementing in an efficient and robust way the
training of a classifier of the supervised type is Support Vector Machines (SVMs). Moreover,
SVMs adhere to the guidelines provided by the Statistical Learning Theory discussed in
section 2.2 of the same chapter. Section 3.1 will examine how and why a linear decision
function can optimally be used as a foundation for the classification task when applied in a
high dimensional space induced by the kernel expansions delineated in section 3.2.
3.1 Large margin linear classifier
3.1.1 Optimal separating hyperplanes
When dealing with a problem where different objects have to be divided in two categories
by placing a discriminating boundary, the most intuitive option is to draw a separating line.
This is exactly the principle applied by SVMs.
More generally, in a N-dimensional space, the line becomes a hyperplane f(x) = wx+b.
The input vector x of dimension N ∈ RN is multiplied by a weighing vector w which needs to
be optimized along with the scalar b. In 2D (2 variables x1 and x2 describing the examples)
the resulting function gives an equation of a plane of coordinates (f(x), x1, x2). If a horizontal
plane is defined at the height of the level curve f(x) = 0 linearly separating the data points
and if these vectors are labeled following the sign of the function f(x) it means that they are
classified either in the positive class if lying above the f(x) = 0 surface or, otherwise, in the
negative class (below the horizontal plane).
In order to construct an optimal hyperplane for a linearly separable case, let us define
some strict conditions for the class-labeling task it carries out. For the training dataset, the
12

3.1 Large margin linear classifier 13
values of the decision function f(x) should respect
wxi + b ≥ 1, if yi = 1
wxi + b ≤ −1, if yi = −1. (3.1)
A positive sample (yi = 1) should therefore be associated with a decision function strictly
greater than or equal to 1 and, on the other hand, a negative input (yi = −1) should be
given a value less than or equal to −1. These two parts of equation (3.1) can be merged into
yi(wxi + b) ≥ 1. (3.2)
This formulation tells us that there should not be any training vector lying in the region
where the hyperplane takes values between +1 and −1 and that only few points would lie
exactly on the level curves of height +1 or −1.
As it can be seen in figure 3.1, the samples located on the level curves are called support
vectors (SVs) and the region between the positive one (f(x) = 1) and the negative one
(f(x) = −1) is referred to as the margin of width . Obviously, the decision boundary
between the two classes becomes the hyperplane f(x) = wx + b = 0.
w x + b = 0
Figure 3.1: Geometrical representation (2D) of the location of SVs and the consequent class
margin placements. Following [19].
The goal of a classifier is to generalize the rules learned from the training data to situations
where new instances have to be classified. Thus, if one tries to place the separating hyperplane
in such a way that the most of the new data points will be found on the correct side of the
class boundary, the solution consists in looking for the largest possible margin.
The small margin hyperplane visible on the left side of figure 3.2 is correctly splitting
the training points (solid colored marks) of the two classes (circles vs. crosses) but when
testing examples (grey marks) are introduced it reveals poor generalization ability (many
misclassification errors). On the contrary, the large margin obtained on the right side is
robust and is more likely to classify the new samples correctly.
The width of the margin can be easily computed as
=
w
kwk
(x+ − x−) =
wx+ − wx−
kwk
=
(1 − b) − (−1 − b)
kwk
=
2
kwk
, (3.3)

14 3 Support Vector Machines for classification
x2
x1 x1
x2
Figure 3.2: The introduction of the testing samples (in grey) leads to many classification
errors when the margin is not optimized (left figure). Modified after [19].
where w is the vector defining the hyperplane, x+ is one of the positive class SVs (contributing
to the margin definition) and x− is a negative class SV.
As shown by equation (3.3), the goal is to minimize kwk. This intuitive minimization
problem is theoretically justified by the insights of the Statistical Learning Theory [39]. It is
stated that the complexity h of the set of functions is bounded by
h ≤ min(R2 kwk2 ,N) + 1, (3.4)
where R is the radius of the smallest sphere enclosing all the training vectors belonging to RN.
Consequently, a large margin, implying a small kwk, contributes at keeping low, pertinent
and thus efficient the capacity of the model.
3.1.2 The optimization problem
In order to accomplish the training of the machine, we are faced with an optimization problem.
SVMs provide a performing algorithm to maximize margin (3.3) whilst respecting constraints
(3.2).
Taking advantage of the concepts of the constrained optimization paradigm (Lagrangian
theory), developed by Lagrange at the end of the 18th century, and the extensions provided in
the 1950s by Kuhn and Tucker, the following results can be derived. After having introduced
the Lagrange multipliers i ≥ 0 associated with the training inputs i one can express the
so-called primal formulation of the optimization problem (primal Lagrangian) as
LP =
1
2 kwk2 −
XL
i=1
iyi(wxi + b) +
XL
i=1
i. (3.5)
Since we are looking for the maximal margin (minimal kwk), the task consists in mini-mizing
(3.5) with respect to w and b. Because of the convexity of function LP , this is done
by searching for the values at which the associated derivatives (3.6) vanish.
@LP (w, b,)
@b
= 0,
@LP (w, b,)
@w
= 0, (3.6)

3.1 Large margin linear classifier 15
The provided results
XL
i=1
iyi = 0, w =
XL
i=1
iyixi, (3.7)
can be substituted to the primal form to get the dual formulation of the problem
LD =
XL
i=1
i −
1
2
XL
i,j=1
ijyiyjxixj . (3.8)
At this point one finds the parameters i by maximizing (3.8) with respect to these
same i subject to the constraints
PL
i=1 iyi = 0 and i ≥ 0, i = 1, . . . ,L. The cited task
actually consists of a quadratic programming problem (quadratic objective function with
linear constraints). The solution of the optimization problem allows the final SVM decision
function to be formulated as
f(x) =
XL
i=1
yiixxi + b. (3.9)
The predicted class label (+1 or −1) is simply assigned following the sign of (3.9) when
dealing with a binary classification task. If the input vectors belong to more than 2 classes, the
solution consists of combining several binary classifiers with either a one class-vs-all approach
or a one-vs-one approach. This multi-class extension of SVMs is accurately described in [34].
A comprehensive and clear description of the optimization problem and its resolution,
summarized in this section, can be found in [34] and [9].
3.1.3 Support Vectors and their relevance
The main outputs of the training procedure of the SVM are the values i. Looking at
equation (3.9), one can see that these coefficients are the weights given to each training
vector xi. However, only a small proportion of them receive a non-zero i. Thus, only a
subset of the initial training set is truly influential in the evaluation of the decision function.
These informative points are called support vectors and, conforming to the situation
depicted by figure 3.1, they lie on the margin (positive or negative side according to their
label yi). For the support vectors, the inequality (3.2) turns into yi(wxi+b) = 1. Given that
such a subset is the only fraction of the data that participates in the prediction, the same
result would be achieved if all the rest of the points were withdrawn from the training set
before training the system.
3.1.4 Soft margin adaptation
In subsection 3.1.1, figure 3.1 shows a linearly separable situation where the two classes are
not overlapping: the training examples are described by inputs that can be partitioned by a
hyperplane. Clearly this is an ideal situation one will rarely be dealing with. In reality, usually
data are noisy so that it is impossible to avoid training errors when drawing a separating
line.
These considerations lead to a slightly different formulation of the large margin classifier,
the soft margin classifier. The “hard” margins presented with (3.2) on page 13 are “softened”

by means of the slack variables i. The intuition consists of letting noisy training samples
(lying outside the class level curve +1 or −1) fulfill the requirements as
yi(wxi + b) ≥ 1 − i. (3.10)
In this way, positive (negative) vectors can be associated with a decision function which
does not have to be strictly larger (smaller) than 1 (−1). For example, a sample lying on the
wrong side of the decision boundary wx + b = 0 will be given a i 1 so that it will then
be treated as a coherent class member. Figure 3.3 shows for which inputs the slack variables
have to be introduced.
Figure 3.3: Slack variables i are assigned to noisy samples lying outside their class margin.
Following [19].
In order to keep a low empirical error (2.5) one should, of course, force the algorithm
to assign non-zero i values to as few as possible of the training samples. Therefore, in the
optimization process, the initial functional (3.5) that has to be minimized is modified so that
its first term is substituted by
1
2 kwk2 + C
XL
i=1
i. (3.11)
The left term in (3.11) is the one the procedure had to minimize for the finding the largest
possible hard margin. The added right term, which also has to be minimized, contributes to
assess the number and relevance of misclassification errors in the training set.
The weighing constant C (cost) allows the user to control these kind of errors during the
training phase and conveys the confidence the user has in the data. With a large value of
C, implying the belief that the dataset is not noisy, every misclassified example is heavily
penalized, leading to a very small training error. The drawback is that such a great impor-tance
conferred to the training data will give rise to the overfitting phenomena due to the
complexity of the applied model. From this point of view, the inverse of C could then be in-terpreted
as a regularization constant. Furthermore, parameter C turns out to be, concerning
the quadratic programming problem, the upper bound for the i, so that 0 ≤ i ≤ C, ∀i.
So, the minimization of the first term of (3.11) corresponds to lowering the upper bound
for the VC-dimension (see equation (3.4)) controlling the confidence interval described in

3.2 Kernel expansion 17
equation (2.6) on page 9. The second term of this functional mainly controls the empirical
error which appears in (2.6). Finally, both terms contribute to keep the expected risk to low
values since also the second term of (3.11) is able to suggest the use of a simple model (small
h) if one chooses a low value of C.
3.2 Kernel expansion
3.2.1 The principle
Up to this point, we have seen how a linear decision function can be optimally applied to
classify our examples with binary labels. When dealing with challenging data sets where the
input/output relationships are non-linear, we need a more clever way to discriminate the two
classes.
The bright idea is to map the dataset into a space of higher dimensions and then perform
the well-known linear separation on the transformed data, rather than applying complex
decision functions directly on the initial data set. This is possible since we have seen in
equation (3.9) on page 15 that, for the linear case, the decision about the class membership
of a new sample x depends only on a dot product between this input vector and all the training
samples xi. Thus, the intuition, called the kernel trick, is to substitute the dot product with
a kernel function K(·, ·) involving the same two vectors, so that the final decision function
changes to
f(x) =
XL
i=1
yiiK(x, xi) + b (3.12)
This is the final formulation of the decision function for a classification task carried out with
SVMs.
The function K(·, ·), for ease of simplicity referred to as the kernel, carries out the higher
dimensional space mapping, not directly by generating the longer coordinates vector out
of the two samples, but in an implicit way. The result of the dot product involving the
mapped vectors, (x) and (xi), is equal to the output of the kernel computed with the
low-dimensional vectors as inputs.
x · xi7→ (x) · (xi) = K(x, xi) (3.13)
Using the machine learning vocabulary, we refer to the original space as the input space,
whilst we name the kernel-induced one the feature space.
3.2.2 A concrete example
As a demonstration, one can try to compute the polynomial kernel of degree 2, defined as
K(x, xi) = (x·xi +1)2, for a couple of inputs belonging to R2, u = (u1, u2) and v = (v1, v2).
One finds out that, as shown by equalities (3.14), the application of such a kernel on u
and v results in the same sum of terms that one would have obtained if a simple dot product
between two high-dimensional mappings of the original vectors. The mapping that we refer

to is the following:
(u) : (u1, u2)7→ (u21
, u22
,√2u1u2,√2u1,√2u2, 1),
resulting in a feature space of 6 dimensions, i.e. (u) ∈ R6.
K(u, v) = K(u · v + 1)2 (3.14)
= u21
v2
1 + u22
v2
2 + 2u1v1u2v2 + 2u1v1 + u2v2 + 1
= (u21
, u22
,√2u1u2,√2u1,√2u2, 1) · (v2
1, v2
2,√2v1v2,√2v1,√2v2, 1)
3.2.3 Valid kernel functions
The rapid developments in the field of kernel methods have brought a wide range of different
kernel functions that can be successfully applied. However, it is important to recall that not
every function involving two vectors constitutes a kernel. In fact, valid kernels have to fulfill
the so-called Mercer’s conditions (see [9]).
In a few words, these constraints must be met for a selected function K(x, xi) to act as
a kernel associated with the desired feature space (output equal to the dot product of the
mapped vectors). Strictly speaking, this means that the kernel matrix K =

K(xi, xj)
n
i,j=1
has to be symmetric and positive semidefinite (possess non-negative eigenvalues). Matrix K,
also known as the Gram matrix, has as elements the outputs of the kernel function for every
pair of input vectors (xi, xj).
Additionally, user defined kernel functions can be created by multiplying or adding valid
kernels since the resulting functions also respect Mercer’s conditions. If K1(·, ·) and K2(·, ·)
are kernels
• aK1(·, ·) + bK2(·, ·) for a, b 0
• K1(·, ·)K2(·, ·)
are valid kernels as well (proof available in [35]). These properties allow us to construct some
composite kernels which will then be useful to improve classification performance (see [4]).
Here we present a list of the most frequently used applicable kernel functions:
• Linear kernel:
K(x, xi) = x · xi (3.15)
• Polynomial kernel:
K(x, xi) = (x · xi + 1)p , p ∈ N (3.16)
• Gaussian RBF kernel:
K(x, xi) = e−
(x−xi)2
22 , ∈ R+ (3.17)
The first item, the linear kernel, corresponds to the situation where the kernel trick has
not been applied, while the second one illustrates the general form (degree p as option) of

3.2 Kernel expansion 19
the polynomial kernel brought into play in subsection 3.2.2. The last kernel mentioned, the
Gaussian Radial Basis Function kernel, will be discussed in more detail in the next subsection.
It is interesting to point out that the choice of the kernel type also allows the user to
control the complexity of the model (bound on risk (2.6) presented on page 9) since the VC-dimension
h also varies according to the feature space into which the inputs are mapped. In
fact, the linear separation performed by the SVM algorithm is executed in the N-dimensional
feature space resulting in a value of h = N + 1 (see subsection 2.2.2).
3.2.4 Details on the Gaussian RBF kernel
Among the available kernel functions, a user’s choice often falls on the well-known Gaussian
RBF kernel due to the simple geometrical interpretation it offers. As one can see from
formula (3.17), the numerator of the argument of the exponential function is nothing but a
dissimilarity measure between vector x and vector xi. In fact (x − xi)2 = kx − xik2, is the
squared Euclidean distance between the examples computed in the input space.
By taking the exponential of its negative value, one assigns a large value only to close
samples. One will notice an exponential decrease starting from a summit of 1, the output
value when evaluating two identical vectors. Since the outputs K(·, ·) are included in (3.12),
the latter are then weighing the training samples xi in the sum over all the L labeled instances.
The labels yi (values of +1 or −1) associated with the inputs will then have different influences
in the final decision function yielding a class membership for the new data point x.
Moreover, the parameter , appearing in the denominator, controls the bandwidth of the
Gaussian surface centered on vector x, the object of the prediction. Figure 3.4 shows how
weights vary according to the kernel width , illustrating the smoothing effect of a large
value of it. In fact, a small bandwidth lets only training vectors xi’s close to x in the input
space contribute significantly to the final decision function.
Gaussian kernel with sigma = 0.5
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
3
x i,1
x i,2
K(xi, 0)
Gaussian kernel with sigma = 1
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
3
x i,1
x i,2
K(xi, 0)
Figure 3.4: The Gaussian RBF kernel function K(x, xi) with x = (0, 0) and xi = (xi,1, xi,2)
for a varying xi is plotted for bandwidths = 0.5 and = 1.
A peculiarity of this kernel is that, contrary to the other two presented here, the similarity
between the input vector x and the training inputs xi is measured as an Euclidean distance

and not in terms of an angle in the input space. The latter is the case when one evaluates
the dot product (linear or polynomial kernels) which can geometrically be interpreted as the
cosine of the angle between the 2 vectors.
3.3 Parameters tuning
As stated in subsection 2.2.1, the parameters (usually referred to as hyper-parameters as
well) defining a good model have to be chosen through the assessment of the quality of the
predictions on an independent data set.
Often, when approaching such task, a cross-validation approach is chosen. This pro-cedure,
precisely named leave-k-out cross-validation, consists in training the model on all
the points of the training set except for a subset formed by k randomly chosen vectors. A
prediction of the output is then carried out for these points, allowing, in the case of classifi-cation,
a labeling comparison. The procedure is repeated until each training vector has been
provisionally removed from the main set (partitioned n/k times).
However, this procedure requires a computationally intensive effort when working with
SVMs. In fact, such an algorithm usually necessitates the classifier to be retrained each time
a new subset of points has been left out. This results in an approach applicable with poor
success to big data sets.
Consequently, as pointed out in section 2.3, the initial training set may be split in two parts
so that a validation set could then be used to evaluate the predictions based on the learned
input/output relationships. A popular validation set/training set partition is 25%/75% of
the original training data.
The hyper-parameters of a SVM model that have to be tuned are the cost C and the kernel
parameters ( for Gaussian RBF kernel, p for polynomial kernel, etc.). Because no direct
analytic function links up the variations of these parameters to the changes in the chosen
performance measure, a grid search approach must be chosen. In the case of Gaussian RBF
kernel, this means that a measure like the classification error (the wide range of performance
measures is depicted in section 3.4) is then computed for a set of different values of C and
spanning a user-defined space. We then look for the values optimizing the classification
performance (lowest validation error, highest accuracy, etc.). A minimal validation error
should correspond to a low percentage of SVs. Effectively, too many SVs being identified
after the training procedure signifies a warning of an overfitting caused by a complex model
(a too small bandwidth can be an example).
In some particular cases where class counts are unbalanced (usually many more negative
examples than positive ones), it is possible that the SVM decision function threshold f(x) = 0
(see subsection 3.1.2) is not the optimal one. In such situations, a threshold tuning can be
carried out as well. This results in a significant improvement of the classifier performance,
in terms of the selected performance measure score. However, if satisfactory results are
not obtained in this manner, an additional effort is required in order to deal with such a
nonstandard situation. A well-suited procedure to apply in these cases is presented in [22]:
the authors propose a modification of the cost function of the SVM.

3.4 Binary classification quality measures 21
Predicted (Forecast)
Class +1 (Yes) Class −1 (No) row totals
Actual (Observed)
Class +1 (Yes) hits misses observed yes
Class −1 (No) false alarms correct negatives observed no
column totals forecast yes forecast no total
Table 3.1: Confusion matrix for binary predictions related to the forecasting of an event.
3.4 Binary classification quality measures
In assessing the classification performance of a supervised learning model, particularly when
dealing with a binary classification task, a broad range of categorical statistics is available
(see [10]). This section will describe the main measures currently used for the case where
model predictions are related to the forecasting of an event (occurrence vs. non-occurrence).
Predictions results may be organized into the 2-by-2 confusion matrix illustrated by table
3.1.
Given that an event may either be observed or not, and then either forecast or not by
the model, 4 possible situations can be encountered: the observed event can be correctly
forecast (hit or true positive) or not detected (miss or false negative), while a non-event can
be incorrectly forecast (false alarm or false positive) or correctly not notified (correct negative
or true negative).
The following ratios are then often used:
True Positive rate (hit rate) =
hits
hits + misses
(3.18)
False Positive rate (false alarm rate) =
false alarms
false alarms + correct negatives
(3.19)
As overall model success measures we can find:
Overall Accuracy =
hits + correct negatives
total
(3.20)
Hanssen and Kuipers discriminant = TP rate − FP rate (3.21)
Heidke Skill Score =
(hits + correct negatives − exp. correct)
total − exp. correct
(3.22)
Bias =
hits + false alarms
hits + misses
=
forecast yes
observed yes
, (3.23)
where “exp. correct” is the expected number of correct forecasts due to random chance.
This value is computed from the theoretical correct forecasts sum totals under independence
assumption (of actual and predicted classes) using marginal sums as
exp. correct =
(forecast yes * observed yes + observed no * forecast no)
total
(3.24)

The first measure, Overall Accuracy (OA, range: 0 → 1, perfect score: 1), reports the
number of correct predictions over the total number of points, suggesting whether the model’s
overall performance is reliable. It becomes a bad statistic if correct negatives (many non-events)
are predominant since classifying every instance in the class −1 will lead to good
scores.
Hanssen and Kuipers discriminant (HK, range: −1 → 1, perfect score: 1), subtracts the
false alarm rate from the hit rate, indicating the capacity of the current forecasting system
for discriminating between events and non-events. Therefore, when non-events are the norm,
this measure is very suitable because the number of false alarms will have less of an influence
on the model performance assessment. Instead a higher importance is given to missed events
(appearing in the denominator of (3.18)). This takes an additional importance in the case
where the 2 types of errors have different costs (e.g. avalanche forecasting). False alarms are
usually less damaging than misses.
Heidke Skill Score (HSS, range: −∞ → 1, perfect score: 1) measures the fraction of
correct predictions after annulling those forecasts which would be correct due purely to
random chance.
The last measure, the bias (range: 0 → ∞, perfect score: 1), does not really indicate the
classification success but tries to inform about over- or under-forecasting, with values tending
to ∞ and, respectively, to 0.
When the selected classifier makes class membership decisions depending on scores that
can be interpreted as the degree to which an example is reasonably a class member (SVMs,
neural networks, etc.), some interesting graphs involving the cited performance measures can
additionally be plotted. In fact, the binary classification is executed according to a defined
threshold resulting in a positive class label if the score is above the threshold t (f(x) t),
or in a negative one if the value is lower than t (f(x) t).
The first insight is to graphically see how the model success measure changes when the
class boundary varies, usually plotting the curve constructed with the points (t,measure).
Moreover, as thoroughly detailed in [12], a Receiver Operating Characteristics (ROC) curve
can be built. Such a plot is a 2-dimensional graph with FP rate as the horizontal axis and TP
rate as the vertical axes. It efficiently represents the tradeoff between the costs and benefits
of the actual classification. In these terms, if we compute the two mentioned rates for the
classifications obtained with thresholds varying from their minimal to maximal values, we
will be able to plot a point (FP rate,TP rate) associated with each selected threshold.
Figure 3.5 shows 2 possible curves in the ROC space. The curve labeled “B” is associated
with a much better performing model compared to the model that produced curve “A”. The
reason is that, no matter which threshold is retained at meaningful FP rates, the resulting
curve is more to the “northwest”, meaning that classifier “B” is producing higher TP rates
combined with lower FP rates than model “A”. As a matter of comparison, the line joining
points (0, 0) and (1, 1) corresponds to the strategy of randomly guessing a class label for
every given instance to classify (if one tries to get more hits by forecasting more positive
labels it also increases the number of false alarms).
When looking for the best possible classification, a point in the ROC space, we might take
Hanssen and Kuipers discriminant as a model success measure since this statistic is nothing

3.4 Binary classification quality measures 23
Figure 3.5: ROC curves associated to 2 different classifiers (“A” and “B”). After [12].
but the difference between the vertical and horizontal axis coordinates, yielding the highest
value for point (0, 1). However, when comparing two systems in an overall sense, the “area
under the curve” measure is a better indicator of the average performance of the classifier
over all possible threshold choices (see [12]).

Chapter 4
Extensions of the SVMs-based
approach
4.1 Feature selection
Feature selection methods provide the classifier with a smaller subset of variables created
out of the initial set so that it can work in a lower dimensional input space with only the
relevant features. This often causes an improvement of the classification accuracy since noisy
and redundant features are filtered out. Moreover, the application of this kind of algorithm
provides the analyst with meaningful information about the real influence or utility of each
input feature used in the classification problem.
4.1.1 Methods overview
Many methods have been proposed to select the best features or to reduce the input space di-mensionality.
They have been reviewed in [15]. The techniques can be divided into categories,
according to the manner in which they deal with the variables.
Methods such as Principal Component Analysis linearly combine the original features
to create new ones. The result is a set of uncorrelated orthogonal variables carrying a
decreasing amount of information (variance). The user may then select only the largest
variance components for the classification task, which aids in avoiding overfitting. However,
no individual feature or features can be ignored since they are all included in the creation of
the new set.
The second category contains techniques that consider each initial feature independently,
without caring about the mutual information between them. Feature ranking with correlation
coefficients, a simple method described in [16], belongs to this kind of approach.
Finally one finds the best performing methods, which take into account simultaneously
all the input variables during the ranking/selection process. This simultaneous consideration
of input variables results in a selection that is much more appropriate when the chosen
classifier is a “multivariate” one (SVMs, Fisher’s linear discriminant, etc.). One such method
is Recursive Feature Elimination, which is explored in more detail in the next subsection.
24

4.1 Feature selection 25
4.1.2 SVM-Recursive Feature Elimination
In [16] Guyon et al. discuss the use of feature ranking coefficients (provided by each discussed
method) as weights in the linear decision function f(x) = wx + b, where w is the vector of
feature weights, x is the input and b is a bias value. The inverse reasoning can be applied
as well, yielding that variable weights multiplying the related inputs can then be used as
coefficients reporting the relevance of each feature. This latter consideration is exactly the
motivation that justifies the Recursive Feature Elimination (RFE) procedure combined with
an SVM classifier. The RFE technique belongs to the broader category of methods named
wrappers (which select the best features according to a criterion of assessment related to
the classifier), opposed to those named filters (which select the best features according to a
criterion independent of the classifier).
The RFE algorithm details differ when using a linear SVM or a non-linear one. In the
following part we will first treat the linear case, while the generalization to a SVM classifier
using a kernel expansion will be discussed as the last topic of this subsection.
The linearly separable case
The algorithm for the linear case can be summarized as follows:
Inputs: Training samples with known class labels (xi, yi)
repeat until every feature k has been removed
– train the linear SVM and compute the weighing vector w =
P
i iyixi
(one component per variable)
– obtain the ranking criterion for each feature k as ck = (wk)2
– find the feature with the lowest c value
– remove the feature values from the initial training data
– update the final ranking list
end repeat
Output: ranked features list (first removed → less relevant)
The interpretation of this procedure is that at every step of the algorithm, after having
trained the SVM, the least influential feature is removed. It is worth remarking that a ranking
list is already obtained after the first iteration by sorting in a decreasing order the coefficients
ck. Anyway, the interest of this feature selection method is that, by running the whole RFE
procedure, an optimal subset of complementary features is found which may not be the most
individually relevant [16].

26 4 Extensions of the SVMs-based approach
Generalization to the non-linear case
When dealing with a non-linear SVM it is impossible to directly compute the components of
vector w because the sample xi, included in a simple dot product in equation (3.9), becomes
here, on the contrary, the input of the kernel function of (3.12). Therefore, the method
consists of looking for the smallest change in the square of the length of vector w when
removing feature k. This value, identified with W()2, is not computed directly as the norm
of w, but as
W()2 = kwk2 =
X
i,j
TH, (4.1)
ijyiyjK(xi, xj) =
where the ’s (forming column vector ) are the weights for each training point found after
the optimization task, K(xi, xj) is the kernel output (scalar) reporting the similarity between
the training samples xi and xj and H is a matrix consisting of elements yiyjK(xi, xj), an
extension of the Gram matrix defined in subsection 3.2.3.
As proposed by Guyon et al., at each iteration, the feature to withdraw according to the
final ranking criteria is selected as
f = argmin
k

, (4.2)
where the notation (−k) denotes that the candidate feature k has not been included in the
computation of (4.1). Since the norm of the weighing vector w defines the margin (see
equation (3.3) on page 13), we select the variable whose removal least changes the distance
between the strict class boundaries f(x) = −1, +1.
For computational facilities, at every iteration, when the variable to remove is selected
from all those available, the ’s are left unchanged and only matrix H is recomputed, with
every candidate ignored in turn. Moreover, this matrix is computed involving only the
support vectors since only for these examples is 0.
The expedients for the extension to the multi-class case of the binary SVM-RFE presented
here can be found in [36].
4.2 Probabilistic SVM output interpretation
4.2.1 Interpretations for decision support
A good model should provide a decision support system with values that can be interpreted
in a meaningful way by users, so that appropriate measures may be taken. The classifier
presented in this chapter is constructed in such a way that the class membership for a new
instance is chosen according to the values taken by the final decision function (3.12). These
values can be suitably transformed by post-processing to yield an a posteriori probability
(from a categorical to a probabilistic forecast). The interpretation of this kind of probabilities
is the class membership likelihood of a given example.
The method that endows a probabilistic output from a SVM model is presented in details
by Platt in [27]. The following subsections only review the points of this theory which have
been used in this thesis.

4.2 Probabilistic SVM output interpretation 27
4.2.2 The sigmoid transform
Applied to such a case, Bayes’ rule allows us to write the posterior probability P(y = 1|f(x))
for a sample x to belong to class +1 as
P(y = 1|f) =
p(f|y = 1)P(y = 1)
p(f)
, (4.3)
where f is the associated decision function, p(f) =
P
l=−1,+1 p(f|y = l)P(y = l) is its a priori
probability, p(f|y = l) is the class conditional probability of observing value f and P(y = l) is
the prior class probability (for class l). All of these probabilities can be empirically computed
from the histogram estimates of the class-conditional densities. This methodology is preferred
to a parametric fit of the latter because the popular Gaussian assumption is often violated.
If a scatterplot of P(y = 1|f) versus f is drawn, one obtains a graphical visualization
(see figure 4.1) of the positive class membership probabilities conditional to each observed
SVM output (decision function f). The goal is to fit an analytically described curve to these
plotted points so that when dealing with a new value f, associated to a new sample, we will
be able to predict its class +1 likelihood. In particular, it turns out that a sigmoid function
of the form
P(y = 1|f) =
1
1 + exp(Af + B)
(4.4)
is, in most cases, very well suited for modeling such a relationship. A and B are the free
parameters to tune with A ∈ R− (to ensure monotonicity) and B ∈ R.
f
P (y = 1 | f )
Figure 4.1: In this example the plus signs indicating posterior class +1 probabilities are
extremely well fitted by the tuned sigmoid function. Modified after [27].
4.2.3 Parameters tuning
The method proposed in [27] consists of minimizing the negative log likelihood of the data
set. For every vector its decision function fi is associated to its actual transformed class label

28 4 Extensions of the SVMs-based approach
ti = (yi + 1)/2 (ti = 0 or 1). We thus aim at minimizing
−
X
i
ti log(pi) + (1 − ti) log(1 − pi), (4.5)
where
pi =
1
1 + exp(Afi + B)
.
One can think of this minimization as a procedure by which one looks for the best function
(defined by parameters A and B) to model the posterior probability of a sample to belong to
its actual class so that this value pi approaches 0 when it is a negative class point (ti = 0) or
approaches 1 when it is a positive class one (ti = 1). The optimization algorithm proposed
in the cited paper is issued from the well-known Levenberg-Marquardt algorithm.
Parameter A controls the slope of the sigmoid whilst B controls its location along the
horizontal axis. If B is equal to 0, there will be an exact matching between the 0.5 posterior
probability and the f = 0 decision function threshold (sign of f providing the vector labeling
decision).
Furthermore, it is important to note that in order to avoid overfitting, the tuning of the
parameters should be carried out on a dataset other than the training set used to produce
the predictions: the validation set.
4.3 Active Learning with SVMs
4.3.1 Principles
In the domain of supervised learning, the interest in the active learning techniques are due
to their ability to provide the classifier with a good, informative subset of training examples.
The goal is to let the machine learn the input/output relationships leading to a satisfactory
classification performance from the least number of labeled input vectors.
Initially, the classifier disposes of Labeled set of training samples (xi, yi) referred to as DL.
At the same time, an Unlabeled set of examples (candidate samples ˜x whose class membership
y is ignored) DUL is available. At each step of the active learning algorithm, the learning
machine should then be able to select from DUL, without knowing the associated class labels,
the data point ˆx whose addition to DL, after the identification of the true class label ˆy, will
lead to the most significant improvement of classification performance (after retraining on
DL ∪ (ˆx, ˆy)).
Examples of applications of such data-driven techniques in the domain of environmental
sciences can be found for the optimization of a monitoring network (soil pollution, radioac-tivity,
etc.) [20], for the reduction of effort in the collection of ground truth data in remote
sensing [37], etc.
4.3.2 Overview of the existing techniques
The active learning field has been fully developing in these last years and the methods
proposed within the scientific community are following one another. In this section, the

4.3 Active Learning with SVMs 29
main SVMs based algorithms are presented. They differ essentially in their sample selection
criteria.
The first approach, described in [26], is quite intuitive and consists in looking for the
unlabeled examples belonging to DUL located in the proximity of the hyperplane separating
the classes. The value of the SVM decision function f for every candidate ˜x is computed
using the current training set DL and then, the one with lowest values of |f(x)| or f(x)2,
denoted ˆx, is added to DL. This vector is in fact very likely to become a Support Vector,
thus affecting the classification procedure.
Entropy-based query by bagging, a method first proposed in [38] then comprehensively
discussed in [37], takes advantage of the notion of entropy to search for the vectors of DUL
whose class membership prediction is the most uncertain (i.e. located closest the decision
boundary where f(x) tends to 0). Such samples are very informative and will contribute
significantly to the set up of the SVM model. Several SVMs are trained on a subset issued
by bootstrapping from DL and class labels for every candidate ˜x are predicted. The best
candidate, ˆx, is selected as the one with the highest entropy computed for the resulting class
membership probabilities. Other similar methods utilize entropy based indicators, such as
the one proposed by Rajan et al. in [31] making use of the Kullback-Leibler divergence, to
select the most valuable examples to include in the model.
The last technique briefly illustrated here is thoroughly described in [19]. Kanevski et al.
suggest the use of the following algorithm. Successively assign to the candidates ˜x the +1
and −1 class labels y and independently add them to DL. The SVM is trained on the newly
created set and the weights, + and −, received by the candidate for either labeling, are
stored. At this point, a sample importance measure is computed as
(xi) =
(
0 , if (+
i = 0, −i = C) or (+
i = C, −i = 0)
+
i +−
i
2C , otherwise.
(4.6)
One can interpret the first case, where one of the 2 weights is null and the other is
equal to C, as the situation where not only does the example lie far from the margin region
−1 f(x) +1 but it also lies on the wrong side of the decision boundary (misclassified
atypical example) for one labeling. Hence, such vectors are not points of interest. In the
remaining cases we give a relevance which is a C scaled mean value of the ’s. This indicator
reports the actual average influence of a given sample in the weighted sum defining the SVM
output f(x).

Chapter 5
Avalanche forecasting as a
spatio-temporal classification
problem
5.1 Avalanche data from Scotland: the Lochaber region
case study
The case study, core of this thesis, that will be illustrated in the next chapters, deals with
the integration of spatial information into a temporal avalanche forecasting model based on
SVMs (see [29]) for the Lochaber region, Scotland (map in figure 5.1). Therefore, the goal is
to produce spatially varying avalanche forecasts at the level of single avalanche paths existing
in this renowned mountaineering area. The latter is one of the 5 ski venues in Scotland for
which forecasting is carried out on a daily basis during the winter season. The region includes
Ben Nevis, the highest mountain in the UK with a highest point of 1344 m above sea level.
Additional detailed information about the avalanche forecasting in Scotland can be found on
the official website of the sportScotland Avalanche Information Service, www.sais.gov.uk.
In the considered area, a nearest neighbor model called Cornice is used operationally
to assist in forecasting the avalanche days [30]. This system has been briefly presented in
subsection 1.2.2, where the prior work carried out on the Lochaber dataset has been reviewed.
Current weather and snowpack conditions are described by a set of 9 variables that are
measured or estimated by local forecasters on the slopes or that are registered by an automatic
weather station (AWS). The list of the available variables is presented in table 5.1, along with
the class/category that each variable is assigned to fall under. As described in [24], the 3
classes of factors influencing an avalanche release in order of decreasing influence are: Class
I - stability factors, Class II - snowpack factors, and Class III - meteorological factors.
None of these variables belong to the group of factors that is the most related to avalanches,
Class I. Stability factors are measured in some cases via stability tests, ski-triggering, etc.
by the avalanche experts when executing snow profiles at the pit site. However, it is dif-
30

5.1 Avalanche data from Scotland: the Lochaber region case study 31
Figure 5.1: Northern part of UK: the Lochaber region is shown labeled with a red balloon.
Variable Class Description
Snow index III Ordinal index of the precipitation as fresh snow on a day,
appreciated by the forecasters in the field
Rain at 900 m III Binary variable indicating if rain is falling at 900 m, the alti-tude
of the AWS (“1” value, “0” otherwise)
Snow drift II Binary variable taking “1” values when experts remark snow
drifting during the observation period (“0” otherwise)
Air temp III Midday air temperature at the AWS measured in ◦C
Wind speed III 24 hours vector mean speed from the AWS reported in m/s
Wind direction III 24 hours vector mean wind direction from the AWS reported
in ◦
Cloud cover III Cloud cover as percentage of the sky
Foot penetration II Penetration of the foot in the snow measured in cm at the pit
site by forecasters
Snow temperature II Snow temperature at a depth of 10 cm at the pit site measured
in ◦C
Table 5.1: List of the 9 meteorological and snowpack variables recorded daily in the winter
season.
ficult to include in the model the information from these tests in terms of variables that
can be automatically compared when building the model. The computation of Euclidean
distances between all the possible pairs of vectors is not possible if the involved variables are
non-numerical or if for some days data are missing.
The pit site where different snowpack factors are measured is chosen every day at some
different distinct location by the forecasters based on their experience. Such a testing place,
usually located on the critical slopes of one of the gullies, is assumed to be representative of
the average conditions that can be found in the entire region.
The 9 variables have been recorded every day of the winter season (roughly 4 months per
year) since 1991, as well as the avalanche events observed in the region. Such occurrences
are documented with a description of the release type (natural or triggered by mountaineers,
dry or wet snow, cornice triggered, etc.), notes about injuries or specific conditions related to
the event and spatial information about location (easting and northing), altitude, slope and

32 5 Avalanche forecasting as a spatio-temporal classification problem
aspect. Only the location coordinates were known for every case, so in order to uniformly
characterize the events, we had to resort to a Digital Elevation Model (DEM) with 10 meters
resolution to obtain the complete set of spatial inputs needed (elevation, slope, aspect). This
procedure was also taken due to typing errors or to some subjective imprecise judgments in
the records.
The hillshade showing the relief is issued from the elevation grid and is presented in figure
5.2 along with the location of the recorded avalanche events falling on the DEM surface. Data
from 1991 to 2007 period were available for this study: information about 688 avalanche
events occurred in 47 different avalanche paths was used. The subset of the events located
in the area covered by the DEM grid (40 gullies) are reported in figure 5.3.
Legend
Observed avalanches
0 500 1’000 1’500
Meters
Figure 5.2: Locations of the 593 documented avalanche cases occurred in the DEM-covered
area
Out of the 593 avalanche events falling on the DEM surface, 224 (37.8%) have been
observed in the Ben Nevis sector (cluster of points in the south-western part of map 5.2),
347 (58.5%) occurred in the Aonach Mor range (eastern part of the Lochaber region) while
22 of them (3.7%) took place on the slopes of the range of Carn Morg Dearg summit (center
of the map).
It is, however, essential to remark that these mentioned events are mainly documented
by avalanche experts of the region during their daily outdoor activity and by climbers or
mountaineers supposed to be reliable witnesses of the release (online recording forms on
www.sais.gov.uk). Therefore, when working with these avalanche reports one has to keep
in mind that the list of events is not at all comprehensive. On bad visibility days, spotting
a release is difficult and since snowfalls are quite often related to such conditions, it is very
likely that many avalanches have taken place without being observed, either by forecasters
or by mountaineers.

5.2 Set up of the spatio-temporal classification problem 33
Legend
Gullies
0 500 1’000 1’500
Meters
Figure 5.3: Locations of the 40 gullies (avalanche paths) in the DEM-covered area
Furthermore, the reporting of avalanches is done much more thoroughly in the Aonach
Mor range because of the easy accessibility of the slopes. In fact there are several ski runs
with associated lifts belonging to the Nevis Range resort.
5.2 Set up of the spatio-temporal classification problem
As we saw in the introductory section 1.2, the temporal forecasting carried out in the
Lochaber region is executed by considering days with observed avalanche activity as pos-itive
examples (class +1) and safe days with no observed avalanches as negative samples
(class −1). Thus, the instances being classified are the days of the winter season.
The described set up has then to be extended to the case where one is interested not
only in correctly forecasting avalanche days but also in predicting the locations of the events.
Initially, we assign to the positive class the vectors characterizing (spatial location, weather
and snowpack conditions) the observed avalanche events. Details about how the features
were built are presented in the next section 5.3.
To complete the binary classification problem, a negative class is needed as well. The
chosen intuitive approach is to let the class with −1 label be composed by all the 47 gullies
(actually the 40 covered by DEM information) susceptible to give rise to an avalanche release
on a safe day. Therefore, for every day of the winter season when all the variables listed in
section 5.1 could be measured and the visibility allowed avalanche observations but no event
was actually documented in the region, we computed all the features describing the local
conditions at each avalanche path. These spatially variable features were then combined
with the global ones related to the current safe day, concerning the whole mountain domain.

34 5 Avalanche forecasting as a spatio-temporal classification problem
In this way, a broad list of negative instances was produced to be given to the learning
machine. The purpose of this was to let the classifier train on a set of critical situations
which were close to the “safe/event” decision boundary and likely to cross it under slightly
different weather conditions.
Finally, this results in a binary classification problem where the vectors to discriminate are
daily avalanche activities of the dangerous paths (gullies) located in the forecast area. This
becomes a very unbalanced classification task because, as shown by table 5.2, the negative
inputs our model will be given considerably outnumber the positive ones (by a factor of
approximately 68 to 1).
Class +1 Class −1
Avalanche events Safe gullies
667 45240 = 40 gullies ·1131 safe days
Table 5.2: Dataset positive and negative classes resulting in an unbalanced classification
problem.
5.3 Choice and conception of the input features
In order to get the desired spatio-temporal forecast, the series of daily measurements of
meteorological conditions related to snowpack stability described in section 5.1 have to be
smartly combined with the spatial description of the terrain morphology available via the
DEM of the region under study. The latter, with its relatively high resolution of 10 me-ters,
provides detailed information about elevation, slope and aspect of the paths where the
avalanche events could happen. This results in a “spatialized” set of local condition features
with changing values according to the location of the avalanche release point. Additionally,
for some of the temporal variables, information about avalanching conditions recorded in the
previous days were also included (2 pre-days at most because of the rapidly changing weather
conditions). Therefore, the features created, considering the advices of the avalanche experts
of the Lochaber region as well, are designed to account for the relevant factors influencing
avalanche activity.
The final input vector counted 39 features: 22 spatio-temporal features (describing local
conditions at a given gully or at the release zone) and 17 temporal features with global
validity (the same for all gullies). The complete list, with a brief description of the meaning
of each variable is presented in table 5.3. For a subset of features (names tagged with *),
additional details about how a given feature has been created are provided hereafter.
The first type of variables requiring some further explanation are the ones involving the
sine and cosine transforms. These features take as input either the wind direction or the
aspect. Since these kinds of variables report a direction measured in degrees ranging from 0
to 360, clockwise starting from north, it is clear that they can not be compared directly by
a subtraction when looking for dissimilarities (e.g. Gaussian RBF kernel). For example, two
slopes with very low or very large values will be both north facing slopes. We get around this
peculiarity by taking a sine transform which will project the directions on the “horizontal”

MasterThesisFinal_09_01_2009_GionaMatasci

MasterThesisFinal_09_01_2009_GionaMatasci

Recommended

Recommended

More Related Content

Similar to MasterThesisFinal_09_01_2009_GionaMatasci

Similar to MasterThesisFinal_09_01_2009_GionaMatasci (20)

MasterThesisFinal_09_01_2009_GionaMatasci