This document discusses using kernel methods, specifically support vector machines (SVMs), for environmental and geoscience applications. It provides an overview of SVMs, including how they find the optimal separating hyperplane with the maximum margin to perform classification and regression. It discusses how SVMs can handle nonlinear decision boundaries using the kernel trick. The document gives examples of applying SVMs to problems like porosity mapping, temperature inversion mapping, and landslide susceptibility modeling. It demonstrates how SVMs can extract patterns from high-dimensional environmental data and produce predictive spatial models.
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Kernel Methods for Environmental and Geo-Sciences (SVM
1. Kernel Methods
(Support Vector Machines)
for
Environmental and Geo- Sciences
Alexei Pozdnoukhov
Lecturer
National Centre for Geocomputation
National University of Ireland, Maynooth
+353 (0)1 7086146
Alexei.Pozdnoukhov@nuim.ie
3. Learning From Data
• Environmental monitoring
Current rate of data acquisition is about
0.5Tb/day (increasing at 82% per year)
• Remote Sensing Data
NASA holds more than 10Pb of data,
increasing by 10x every 5 years.
ESA data stream is about 0.5Tb/year,
likely to increase by 20x in next 5 years.
• GIS, DEM
• Sensor Networks
• Field Measurements
8. Curse of Dimensionality
Sensor Network
Sensor Network
Need more data?
Batteries Recharged at WSN
Human activity
Remote Sensing
Wireless Sensor Network
Geographical Information
9. Detecting Events
Observed environment: Events: Very Rare, Extreme
high-dimensional input space
• High-dimensional spaces: risk of overfitting
• Robust to noise in both inputs/outputs
• Non-linear and non-parametric
• Computationally effective for real-time processing and LBS dissemination
11. Statistical Learning Theory
• Models that can generalise from data
• Good predictive abilities
• Complexity can be controlled
12. Statistical Learning Theory
• Occam’s Razor Principle (14th century)
One should not increase, beyond what is necessary,
the number of entities required to explain anything
• When many solutions are available for a given problem, we
should select the simplest one.
• But what do we mean by simple?
• We will use prior knowledge of the problem to solve to define
what is a simple solution (example of a prior: smoothness).
13. Occam’s Razor and Classification
Model 1 Model 2 Model 3
Complexity √√ √ ××
Training error ×× √ √√
Overall - √ -
14. Structural Risk Minimization
• Define a set of learning functions, {S}
• Order it in terms of complexity, {S1, …, SN}
• Select the optimal S*
F = {f(x,α), α∈Λ}
16. Separating Hyperplane
x - input patterns
w - weight vector
b - threshold
f w,b ( x ) = sign ( w ⋅ x + b)
How powerful are linear decision functions?
17. VC-dimension in classification
Shattering
• the number of samples which can be discriminated by the function for all
possible class memberships – shattered.
3 samples: x
x
x
4 samples:
x ? x
VC-dimension h of the linear
decision functions in RN equals N+1
That is, the power of linear decision functions is beyond our control…?
18. Support Vector Machine
Decision function is a margin hyperplane(*)
1, (w⋅ x) − b ≥ 1
f (x,{w, b}) =
−1, (w⋅ x) − b ≤ −1
Intuition:
Large Margin is good.
Lemma: Given that the N-dimensional data {xl, x2, …xL} lie inside a finite
enclosing sphere of the radius R, the VC-dimension h of the margin-based
decision functions (*) follows the inequality:
h ≤ min R2 w , N +1
2
The complexity (VC-dimension) can be controlled with ||w||2 !!
19. Separating Hyperplane: Max Margin
To maximize the margin ρ, one would like to minimize ||w||, or ||w||2.
1, (w ⋅ x) − b ≥ 1
fw,b ( x) = f w,b ( x) = sign (( w ⋅ x) + b)
−1, (w ⋅ x) − b ≤ −1
20. Optimization Problem, Lagrangian
{ 1 2
min w
2
⇒
yi ( w ⋅ xi + b) ≥ 1, i = 1,..., L.
{
L
w − ∑ α i ( yi ( w ⋅ xi + b) − 1)
2
Lp = 1
2
i =1
L
⇒ ∑α ⋅ y
i =1
i i = 0,
L
w = ∑ αi ⋅ yi ⋅ xi
i =1
KKT conditions: αi > 0 - Support Vectors
α i ( yi ( w ⋅ xi + b) − 1) = 0, ∀i αi = 0
21. Optimization Problem: Dual Variables.
L L
LD = ∑ α i − 1 ∑ α iα j yi y j ( xi ⋅ x j )
2
i =1 i , j =1
L
∑α y
i =1
i i =0
α i ≥ 0, i = 1,...L
L
f ( x ) = sign ( w ⋅ x + b) = sign ∑ α i yi ( x ⋅ xi ) + b
i =1
• inputs are presented as dot products
• Quadratic Programming
• convex problem, nice theoretical field
• unique solution, good solvers
22. Soft margin hyperplane:
allowing for the training error.
error
{
L
w + C ∑ξi
1 2
min 2
i =1
yi ( w ⋅ xi + b) ≥ 1 − ξ i , i = 1,..., L.
ξ i ≥ 0, i = 1,...L
{
L L
LD = ∑ α i − 1
2 ∑α α i j yi y j ( xi ⋅ x j )
i =1 i , j =1 C - regularization parameter
L
∑α y i i =0 trade-off between
i =1 margin maximization
0 ≤ αi ≤ C , i = 1,...L &
training error
23. Support Vector Terminology
αi = 0 Normal Samples
0 < αi < C Support Vectors
αi = C Support Vectors
untypical or noisy
C - regularization parameter
L
f ( x ) = sign ∑ α i yi ( x ⋅ xi ) + b
i =1 trade-off between
margin maximization
&
training error
24. Support Vector Algorithm
Kernel Trick
If data is not linearly separable, it can be projected into (sufficiently)
high dimensional space. There it is much easier to separate!
Example. K ( x, x′) = ( x ⋅ x′) 2
x12
x1
x → 2 x1 x2
2 x2 2
x → Φ( x) ? The algorithm was formulated in terms of dot products!
x ⋅ x′ → Φ ( x ) ⋅ Φ ( x′) ⇔ x ⋅ x′ → K ( x, x′) •K is symmetric
•K is positive-definite
25. Nonlinear SVM. Kernel trick.
f ( x ) = wx + b →
L
f ( x ) = ∑ yiα i K ( x, xi ) + b
i =1
Any linear algorithm, formulated in terms of dot products of input data,
can be modified into a non-linear one using the kernel trick.
trick
• Support Vector Machine
• Kernel Ridge Regression
• Kernel Principle Component Analysis
• Kernel Fischer Discriminant Analysis
• etc.
26. Nonlinear SVM. Kernel types.
• Polynomial kernel: K ( x, y ) = ( x ⋅ y + 1) p
2
x− y
−
• Radial Basis Function kernel: K ( x, y ) = e 2σ 2
f ( x ) = sign ( ∑ yiαi K ( x, xi ) + b)
i∈SV
27. Nonlinear SVM. Optimization problem.
L L
LD = ∑ α i − 1 ∑ α iα j yi y j K ( xi , x j )
2
i =1 i , j =1
L
∑α y
i =1
i i =0
0 ≤ α i ≤ C , i = 1,...L
L
b = yi − ∑ yiα i K ( xi , x j ) f ( x ) = sign( ∑ yiαi K ( x, xi ) + b)
i =0 i∈SV
K is positive-definite, still QP programming, hence unique solution!
31. SV Porosity Mapping
Data description
200 training samples
“+” 94 validation samples
minimum = 0.0
median = 0.515
max = 1.000
mean = 0.53
variance = 0.048
The original continuous data were transformed into 2-class data according to the
0.5 threshold:
If fpor ≥ 0.5, then y = +1
If fpor < 0.5, then y = -1
32. SV Porosity Mapping
Data: 2-class transformation
• class “+1”, ≥ 0.5
o class “-1”, < 0.5
+ validation data
33. SV Porosity Mapping
Data loading
150 training samples
50 testing samples
Prediction Grid
34. SV Porosity Mapping
Hyper-parameters tuning
2
x − x′
• Gaussian RBF kernel is selected. −
K ( x, x′) = e 2σ 2
• Two hyper-parameters: C and σ.
• Grid search: testing error analysis for every pair of paramaters.
The range of σ
min(σ) - minimum distance between data samples
max(σ) - max distance between data samples
The range of log(C)
min(C) - some small value, 1 or less
max(C) – depends on data, 1e3-1e6
Start calculation using testing data
Save results to file
35. SV Porosity Mapping
Hyper-parameters tuning
Training error surface
Log(C)
Gaussian RBF kernel bandwidth
• increase with kernel bandwidth
• decrease with C
36. SV Porosity Mapping
Hyper-parameters tuning
Testing error surface
Log(C)
Gaussian RBF kernel bandwidth
Complex structure, but generally, if the range is selected reasonably and
data splitting is correct, there exist a region of minima – optimal values.
37. SV Porosity Mapping
Hyper-parameters tuning
Normalized number of Support Vectors
Log(C)
Gaussian RBF kernel bandwidth
Represents the complexity of the model, the more complex one has more SVs.
38. What are the parameters for the final model?
Hyper-parameters selection
Testing error
C=3
σ = 0.09
Training error Normalized NSV
39. What are the parameters for the final model?
Hyper-parameters selection
Testing error
C = 18
σ = 0.13
Training error Normalized NSV
40. SV Porosity Mapping
Dependence on Parameters
C = 10
σ
0.02 0.06 0.1 0.2 0.3 0.4 0.5
46. Physical Models at local scales
• Terrain roughness is too high for physical models, computational speed,
precision, uncertainty estimation…
PDE on smoothed terrain + empirical correction
vModel ( x, y ) = vPhysical + cRidges + cCanyons + cValues + cFlatAreas + cSea ...
Can this information be extracted directly from data?
47. Modelling Scheme
Data Predictive Modeling
with
Machine Learning
DEM
Non-linear dependencies
Noise, Outliers
Spatio-Temporal Mapping
F
E
A
T
U ….
R
Feature
E
Selection/Extraction
S Analysis Decision Support
48. Temperature vs. Elevation
Mean Monthly Mean Daily
Linear Locally Linear
Regionalized
Mean Hourly Mean Hourly
Explained
Non-linear
Regionalized Temperature
Inversion
49. DEM Features
Large Scale Difference of Gaussians Short Scale Difference of Gaussians
Slope Local Variance
62. Lochaber, Scotland
• 1842 days of weather conditions (11 features) recording,
1991-2007
• 1135 days with documented avalanche events
• 797 safe days, 245 with avalanches
• 260 days unknown (mainly bad weather)
63. Spatial Data
Training data: 722 events,
winters 1991-2005
Validation data: 72 events,
winters 2006-2007
• 47 avalanche paths, x, y, z, slope, aspect, date
• DEM, 10m resolution, 5km x 5km
64. Lochaber weather
observations
• Snow index 0-10
• No-settle cumulative Snow over a season
• Rain at 900m binary [0, 1]
• Snow drift binary [0, 1]
• Air temperature -10,… +10
• Wind speed 0, … 25 m/s
• Wind Direction 0o-360o
• Cloudness [25, 50, 75, 100]
• Foot penetration 0, … 50
• Snow temperature 0, … -10
• Insolation cumulative over season
65. Classification Problem
Z Slope Aspect: SN-WE [Spatialized Weather Features] +1
720 …over all the documented avalanche events…
Z Slope Aspect: SN-WE [Spatialized Weather Features] +1
Z Slope Aspect: SN-WE [Spatialized Weather Features] -1
44000 …over all the 47 gullies for documented days without avalanches…
Z Slope Aspect: SN-WE [Spatialized Weather Features] -1
4 + 22 = 26
66. Wind Speed and Direction
Wind speed weighting:
Correction for slope:
Correction for curvature:
Terrain-corrected wind direction:
67. Snow accumulation
Simple heuristics based on wind speed gradients
If Snow index > 0
If Snow drift = 1
Snow accumulation =
F(Wind Speed,
Wind Direction)
86. Summary and Conclusions
• Statistical Learning Theory
• Classification Problem
• Support Vector Machines and Kernel Methods
• GeoSpatial Data Classification with SVM
87. Open PhD positions at NCG
Thank you!
Alexei Pozdnoukhov
Alexei.Pozdnoukhov@nuim.ie
SFI (SRC-ID 07/SRC/I1168)