Kernel Methods for Environmental and Geo-Sciences (SVM

Kernel Methods
(Support Vector Machines)
for
Environmental and Geo- Sciences

Alexei Pozdnoukhov

Lecturer

National Centre for Geocomputation
National University of Ireland, Maynooth

+353 (0)1 7086146
Alexei.Pozdnoukhov@nuim.ie

Learning From Data
• Environmental monitoring
Current rate of data acquisition is about
0.5Tb/day (increasing at 82% per year)

• Remote Sensing Data
NASA holds more than 10Pb of data,
increasing by 10x every 5 years.
ESA data stream is about 0.5Tb/year,
likely to increase by 20x in next 5 years.

• GIS, DEM
• Sensor Networks
• Field Measurements

Clustering

Cluster 1

Cluster 2

Classification

Binary Multi-Class

Regression

y

Input, x

Curse of Dimensionality
Sensor Network
Sensor Network
Need more data?

Batteries Recharged at WSN
Human activity
Remote Sensing
Wireless Sensor Network
Geographical Information

Detecting Events
Observed environment: Events: Very Rare, Extreme
high-dimensional input space

• High-dimensional spaces: risk of overfitting
• Robust to noise in both inputs/outputs
• Non-linear and non-parametric
• Computationally effective for real-time processing and LBS dissemination

Statistical Learning Theory

• Models that can generalise from data
• Good predictive abilities
• Complexity can be controlled

Statistical Learning Theory

• Occam’s Razor Principle (14th century)

One should not increase, beyond what is necessary,
the number of entities required to explain anything

• When many solutions are available for a given problem, we
should select the simplest one.

• But what do we mean by simple?

• We will use prior knowledge of the problem to solve to define
what is a simple solution (example of a prior: smoothness).

Occam’s Razor and Classification

Model 1 Model 2 Model 3
Complexity √√ √ ××
Training error ×× √ √√
Overall - √ -

Structural Risk Minimization

• Define a set of learning functions, {S}
• Order it in terms of complexity, {S1, …, SN}
• Select the optimal S*

F = {f(x,α), α∈Λ}

Classification

Support Vector Machine
SVM

Separating Hyperplane

x - input patterns
w - weight vector

b - threshold

f w,b ( x ) = sign ( w ⋅ x + b)
How powerful are linear decision functions?

VC-dimension in classification
Shattering

• the number of samples which can be discriminated by the function for all
possible class memberships – shattered.

3 samples: x

x
x

4 samples:
x ? x

VC-dimension h of the linear
decision functions in RN equals N+1

That is, the power of linear decision functions is beyond our control…?


Decision function is a margin hyperplane(*)

 1, (w⋅ x) − b ≥ 1
f (x,{w, b}) = 
 −1, (w⋅ x) − b ≤ −1

Intuition:

Large Margin is good.

Lemma: Given that the N-dimensional data {xl, x2, …xL} lie inside a finite
enclosing sphere of the radius R, the VC-dimension h of the margin-based
decision functions (*) follows the inequality:
h ≤ min R2 w , N  +1
2
 

The complexity (VC-dimension) can be controlled with ||w||2 !!

Separating Hyperplane: Max Margin

To maximize the margin ρ, one would like to minimize ||w||, or ||w||2.

 1, (w ⋅ x) − b ≥ 1
fw,b ( x) =  f w,b ( x) = sign (( w ⋅ x) + b)
 −1, (w ⋅ x) − b ≤ −1

Optimization Problem, Lagrangian

{ 1 2
min w
2
⇒
yi ( w ⋅ xi + b) ≥ 1, i = 1,..., L.

{
L
w − ∑ α i ( yi ( w ⋅ xi + b) − 1)
2
Lp = 1
2
i =1
L

⇒ ∑α ⋅ y
i =1
i i = 0,
L
w = ∑ αi ⋅ yi ⋅ xi
i =1

KKT conditions: αi > 0 - Support Vectors

α i ( yi ( w ⋅ xi + b) − 1) = 0, ∀i αi = 0

Optimization Problem: Dual Variables.

L L
LD = ∑ α i − 1 ∑ α iα j yi y j ( xi ⋅ x j )
2
i =1 i , j =1
L

∑α y
i =1
i i =0

α i ≥ 0, i = 1,...L

 L 
f ( x ) = sign ( w ⋅ x + b) = sign  ∑ α i yi ( x ⋅ xi ) + b 
 i =1 

• inputs are presented as dot products
• Quadratic Programming
• convex problem, nice theoretical field
• unique solution, good solvers

Soft margin hyperplane:
allowing for the training error.
error

{
L
w + C ∑ξi
1 2
min 2
i =1
yi ( w ⋅ xi + b) ≥ 1 − ξ i , i = 1,..., L.
ξ i ≥ 0, i = 1,...L

{
L L
LD = ∑ α i − 1
2 ∑α α i j yi y j ( xi ⋅ x j )
i =1 i , j =1 C - regularization parameter
L

∑α y i i =0 trade-off between
i =1 margin maximization
0 ≤ αi ≤ C , i = 1,...L &
training error

Support Vector Terminology

αi = 0 Normal Samples

0 < αi < C Support Vectors

αi = C Support Vectors
untypical or noisy

C - regularization parameter
 L 
f ( x ) = sign  ∑ α i yi ( x ⋅ xi ) + b 
 i =1  trade-off between
margin maximization
&
training error

Support Vector Algorithm
Kernel Trick
If data is not linearly separable, it can be projected into (sufficiently)
high dimensional space. There it is much easier to separate!

Example. K ( x, x′) = ( x ⋅ x′) 2
 x12 
 x1   
 x  →  2 x1 x2 
 2   x2 2  
 

x → Φ( x) ? The algorithm was formulated in terms of dot products!

x ⋅ x′ → Φ ( x ) ⋅ Φ ( x′) ⇔ x ⋅ x′ → K ( x, x′) •K is symmetric
•K is positive-definite

Nonlinear SVM. Kernel trick.

f ( x ) = wx + b →
L
f ( x ) = ∑ yiα i K ( x, xi ) + b
i =1

Any linear algorithm, formulated in terms of dot products of input data,
can be modified into a non-linear one using the kernel trick.
trick

• Support Vector Machine
• Kernel Ridge Regression
• Kernel Principle Component Analysis
• Kernel Fischer Discriminant Analysis
• etc.

Nonlinear SVM. Kernel types.

• Polynomial kernel: K ( x, y ) = ( x ⋅ y + 1) p
2
x− y
−
• Radial Basis Function kernel: K ( x, y ) = e 2σ 2

f ( x ) = sign ( ∑ yiαi K ( x, xi ) + b)
i∈SV

Nonlinear SVM. Optimization problem.

L L
LD = ∑ α i − 1 ∑ α iα j yi y j K ( xi , x j )
2
i =1 i , j =1
L

∑α y
i =1
i i =0

0 ≤ α i ≤ C , i = 1,...L

L
b = yi − ∑ yiα i K ( xi , x j ) f ( x ) = sign( ∑ yiαi K ( x, xi ) + b)
i =0 i∈SV

K is positive-definite, still QP programming, hence unique solution!


http://www.geokernels.org/teaching/svm

SV Porosity Mapping

Data description

200 training samples
“+” 94 validation samples
minimum = 0.0
median = 0.515
max = 1.000
mean = 0.53
variance = 0.048

The original continuous data were transformed into 2-class data according to the
0.5 threshold:
If fpor ≥ 0.5, then y = +1
If fpor < 0.5, then y = -1

SV Porosity Mapping

Data: 2-class transformation

• class “+1”, ≥ 0.5
o class “-1”, < 0.5

+ validation data

SV Porosity Mapping

Data loading

150 training samples
50 testing samples

Prediction Grid

SV Porosity Mapping
Hyper-parameters tuning

2
x − x′
• Gaussian RBF kernel is selected. −
K ( x, x′) = e 2σ 2
• Two hyper-parameters: C and σ.
• Grid search: testing error analysis for every pair of paramaters.

The range of σ
min(σ) - minimum distance between data samples
max(σ) - max distance between data samples

The range of log(C)
min(C) - some small value, 1 or less
max(C) – depends on data, 1e3-1e6

Start calculation using testing data
Save results to file

SV Porosity Mapping

Training error surface
Log(C)

Gaussian RBF kernel bandwidth

• increase with kernel bandwidth
• decrease with C

SV Porosity Mapping

Testing error surface
Log(C)


Complex structure, but generally, if the range is selected reasonably and
data splitting is correct, there exist a region of minima – optimal values.

SV Porosity Mapping

Normalized number of Support Vectors
Log(C)


Represents the complexity of the model, the more complex one has more SVs.

What are the parameters for the final model?

Hyper-parameters selection

Testing error

C=3
σ = 0.09

Training error Normalized NSV

What are the parameters for the final model?

Hyper-parameters selection

Testing error

C = 18
σ = 0.13

Training error Normalized NSV

SV Porosity Mapping
Dependence on Parameters

C = 10

σ
0.02 0.06 0.1 0.2 0.3 0.4 0.5

SV Porosity Mapping
Dependence on Parameters

C=100

C=10

C=1

C=0.1

σ = 0.1

SV Porosity Mapping
Predictive Mapping and Support Vectors

Predictive mapping
+
MARGIN
+
Normal SV, 0<α<C.
+
Critical SV, α=C.

Applications for Natural Hazards

• Topo-climatic mapping
• Landslides
• Snow avalanches prediction

Weather observations

• 110 meteo stations
• Measurements, up to every
10min
• Altitude: 270m-3580m

• Temperature
• Precipitation
• Humidity
• Air Pressure
• Wind Speed
• Insolation
• Etc.

Spatio-temporal prediction mapping?

Temperature Inversion

Can only be explained using terrain surface characteristics (convexity, slope, etc.)

Physical Models at local scales

• Terrain roughness is too high for physical models, computational speed,
precision, uncertainty estimation…

PDE on smoothed terrain + empirical correction
vModel ( x, y ) = vPhysical + cRidges + cCanyons + cValues + cFlatAreas + cSea ...

Can this information be extracted directly from data?

Modelling Scheme
Data Predictive Modeling
with
Machine Learning

DEM

Non-linear dependencies
Noise, Outliers
Spatio-Temporal Mapping
F
E
A
T
U ….
R
Feature
E
Selection/Extraction
S Analysis Decision Support

Temperature vs. Elevation

Mean Monthly Mean Daily

Linear Locally Linear
Regionalized

Mean Hourly Mean Hourly
Explained
Non-linear
Regionalized Temperature
Inversion

DEM Features
Large Scale Difference of Gaussians Short Scale Difference of Gaussians

Slope Local Variance

Temperature Inversion Mapping

Probability of Inversion
Temperature

Operational setting

http://www.geokernels.org/services/meteo

Applications

• Topo-climatic mapping
• Landslides
• Snow avalanches prediction
• Remote Sensing

Landslide inventory

SFI (SRC-ID 07/SRC/I1168)

Method I
Factor 1 Probability density estimation

Factor 2


Model vs. Training Data


What is wrong with this susceptibility map?


Method II
Classification

Stable
Factor 1

Unstable
Factor 2


Predictive models


A model should fit the observed landslides, and …


Lochaber, Scotland
• 1842 days of weather conditions (11 features) recording,
1991-2007
• 1135 days with documented avalanche events
• 797 safe days, 245 with avalanches
• 260 days unknown (mainly bad weather)

Spatial Data
Training data: 722 events,
winters 1991-2005

Validation data: 72 events,
winters 2006-2007

• 47 avalanche paths, x, y, z, slope, aspect, date
• DEM, 10m resolution, 5km x 5km

Lochaber weather
observations
• Snow index 0-10
• No-settle cumulative Snow over a season
• Rain at 900m binary [0, 1]
• Snow drift binary [0, 1]
• Air temperature -10,… +10
• Wind speed 0, … 25 m/s
• Wind Direction 0o-360o
• Cloudness [25, 50, 75, 100]
• Foot penetration 0, … 50
• Snow temperature 0, … -10
• Insolation cumulative over season

Classification Problem

Z Slope Aspect: SN-WE [Spatialized Weather Features] +1

720 …over all the documented avalanche events…

Z Slope Aspect: SN-WE [Spatialized Weather Features] +1

Z Slope Aspect: SN-WE [Spatialized Weather Features] -1

44000 …over all the 47 gullies for documented days without avalanches…

Z Slope Aspect: SN-WE [Spatialized Weather Features] -1

4 + 22 = 26

Wind Speed and Direction

Wind speed weighting:

Correction for slope:

Correction for curvature:

Terrain-corrected wind direction:

Snow accumulation

Simple heuristics based on wind speed gradients

If Snow index > 0

If Snow drift = 1

Snow accumulation =
F(Wind Speed,
Wind Direction)

Results

DEM Avalanche Danger

Results
wind

Animation in 3D

Inhabited areas
Ground truth is known: population census

Testing Training

Inhabited areas
Ground truth is known: population census

Pre-processing and Features

Mathematical morphology (image closing)


SIFT


Gaussian Mixture Model

Summary and Conclusions

• Statistical Learning Theory
• Classification Problem
• Support Vector Machines and Kernel Methods

• GeoSpatial Data Classification with SVM

Open PhD positions at NCG

Thank you!

Alexei Pozdnoukhov
Alexei.Pozdnoukhov@nuim.ie


Kernel Methods for Environmental and Geo-Sciences (SVM

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kernel Methods for Environmental and Geo-Sciences (SVM

Similar to Kernel Methods for Environmental and Geo-Sciences (SVM (20)

More from Beniamino Murgante

More from Beniamino Murgante (20)

Recently uploaded

Recently uploaded (20)

Kernel Methods for Environmental and Geo-Sciences (SVM