A comparison of learning methods to predict N2O fluxes and N leaching

A comparison of learning methods to predict N2O
ﬂuxes and N leaching
Nathalie Villa-Vialaneix
nathalie.villa@toulouse.inra.fr
http://www.nathalievilla.org
Workshop on N2O meta-modelling, March 9th, 2015
INRA, Toulouse
Nathalie Villa-Vialaneix | Comparison of metamodels 1/37

Sommaire
1 DNDC-Europe model description
2 Methodology
Underﬁtting / Overﬁtting
Consistency
Problem at stake
3 Presentation of the different methods
4 Results

Sommaire
2 Methodology
Consistency
Problem at stake
4 Results

General overview
Modern issues in agriculture
ﬁght against the food crisis;
while preserving environments.

General overview
Modern issues in agriculture
ﬁght against the food crisis;
while preserving environments.
EC needs simulation tools to
link the direct aids with the respect of standards ensuring proper
management;
quantify the environmental impact of European policies (“Cross
Compliance”).

Cross Compliance Assessment Tool
DNDC is a biogeochemical model.

Zoom on DNDC-EUROPE

Moving from DNDC-Europe to metamodeling
Needs for metamodeling
easier integration into CCAT
faster execution and responding scenario analysis

Data [Villa-Vialaneix et al., 2012]
Data extracted from the biogeochemical simulator DNDC-EUROPE: ∼
19 000 HSMU (Homogeneous Soil Mapping Units 1km2
but the area is
quite varying) used for corn cultivation:
corn corresponds to 4.6% of UAA;
HSMU for which at least 10% of the agricultural land was used for
corn were selected.

Data extracted from the biogeochemical simulator DNDC-EUROPE:
11 input (explanatory) variables (selected by experts and previous
simulations)
N FR (N input through fertilization; kg/ha y);
N MR (N input through manure spreading; kg/ha y);
Nﬁx (N input from biological ﬁxation; kg/ha y);
Nres (N input from root residue; kg/ha y);
BD (Bulk Density; g/cm3
);
SOC (Soil organic carbon in topsoil; mass fraction);
PH (Soil pH);
Clay (Ratio of soil clay content);
Rain (Annual precipitation; mm/y);
Tmean (Annual mean temperature; C);
Nr (Concentration of N in rain; ppm).

Data extracted from the biogeochemical simulator DNDC-EUROPE:
2 outputs to be estimated (independently) from the inputs:
N2O ﬂuxes (greenhouse gaz);
N leaching (one major cause for water pollution).

Sommaire
2 Methodology
Consistency
Problem at stake
4 Results

Regression
Consider the problem where:
Y ∈ R has to be estimated from X ∈ Rd
;
we are given a learning set, i.e., n i.i.d. observations of (X, Y),
(x1, y1), . . . , (xn, yn).
Example: Predict N2O ﬂuxes from pH, climate, concentration of N in rain,
fertilization for a large number of HSMU . . .

Basics
From (xi, yi)i, deﬁnition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).

Basics
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function;
if Y is a factor, Φn
is called a classiﬁer;

Basics
s.t.:
ˆynew = Φn
(xnew).
Φn
is said to be trained or learned from the observations (xi, yi)i.

Basics
s.t.:
ˆynew = Φn
(xnew).
Φn
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;

Basics
s.t.:
ˆynew = Φn
(xnew).
Φn
generalization ability: predictions made on new data are also
accurate.

Basics
s.t.:
ˆynew = Φn
(xnew).
Φn
generalization ability: predictions made on new data are also
accurate.
Conﬂicting objectives!! [Vapnik, 1995]

Underﬁtting/Overﬁtting
Function x → y to be estimated

Observations we might have

Observations we do have

First estimation from the observations: underﬁtting

Second estimation from the observations: accurate estimation

Third estimation from the observations: overﬁtting

Summary

Errors
training error (measures the accuracy to the observations)

Errors
if y is a factor: misclassiﬁcation rate
{ˆyi yi, i = 1, . . . , n}
n

Errors
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
test error: a way to prevent overﬁtting (estimates the generalization
error) is the simple validation

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
test error: a way to prevent overﬁtting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Φn
from the training dataset
3 calculate the test error from the remaining data

Example
Observations

Example
Training/Test datasets

Example
Training/Test errors

Example
Summary

Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y = βT
X +
β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a βn
.
The estimation is said to be consistent if βn n→+∞
−−−−−−→ β under (eventually)
technical assumptions on X, , Y.

Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = Φ(X) +
Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a Φn
.
The estimation is said to be consistent if Φn n→+∞
−−−−−−→ Φ under (eventually)
technical assumptions on X, , Y.

Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...

Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn
from
the observations is said to be (universally) consistent if, given a risk
function R : R × R → R+ (which calculates an error),
E (R(Φn
(X), Y))
n→+∞
−−−−−−→ inf
Φ:X→R
E (R(Φ(X), Y)) ,
for any distribution of (X, Y) ∈ X × R.
Deﬁnitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)).

Purpose of the work
We focus on methods that are universally consistent. These methods lead
to the deﬁnition of machines Φn
such that:
ER(Φn
(X), Y)
N→+∞
−−−−−−→ L∗
= inf
Φ:Rd →R
LΦ
for any random pair (X, Y).

Purpose of the work
We focus on methods that are universally consistent. These methods lead
to the deﬁnition of machines Φn
such that:
ER(Φn
(X), Y)
N→+∞
−−−−−−→ L∗
= inf
Φ:Rd →R
LΦ
for any random pair (X, Y).
1 multi-layer perceptrons (neural networks): [Bishop, 1995]
2 Support Vector Machines (SVM): [Boser et al., 1992]
3 random forests: [Breiman, 2001] (universal consistency is not proven
in this case)

Methodology
Purpose: Comparison of several metamodeling approaches (accuracy,
computational time...).

Methodology
For every data set, every output and every method,
1 The data set was split into a training set and a test set (on a 80%/20%
basis);
2 The regression function was learned from the training set (with a full
validation process for the hyperparameter tuning);

Methodology
For every data set, every output and every method,
1 The data set was split into a training set and a test set (on a 80%/20%
basis);
2 The regression function was learned from the training set (with a full
validation process for the hyperparameter tuning);
3 The performances were calculated on the basis of the test set: for the
test set, predictions were made from the inputs and compared to the
true outputs.

Methods
2 linear models:
one with the 11 explanatory variables;
one with the 11 explanatory variables plus several nonlinear
transformations of these variables (square, log...): stepwise AIC was
used to train the model;
MLP
SVM
RF
3 approaches based on splines: ACOSSO (ANOVA splines), SDR
(improvement of the previous one) and DACE (kriging based
approach).

Sommaire
2 Methodology
Consistency
Problem at stake
4 Results

Multilayer perceptrons (MLP)
A “one-hidden-layer perceptron” takes the form:
Φw : x ∈ Rd
→
Q
i=1
w
(2)
i
G xT
w
(1)
i
+ w
(0)
i
+ w
(2)
0
where:
the w are the weights of the MLP that have to be learned from the
learning set;
G is a given activation function: typically, G(z) = 1−e−z
1+e−z ;
Q is the number of neurons on the hidden layer. It controls the
ﬂexibility of the MLP. Q is a hyper-parameter that is usually tuned
during the learning process.

Symbolic representation of MLPINPUTS
x1
x2
. . .
xd
w
(1)
11
w
(1)
pQ
Neuron 1
Neuron Q
φw(x)
w
(2)
1
w
(2)
Q
+w
(0)
Q

Learning MLP
Learning the weights: w are learned by a mean squared error
minimization scheme :
w∗
= arg min
w
N
i=1
L(yi, Φw(xi)).

Learning MLP
minimization scheme penalized by a weight decay to avoid overﬁtting
(ensure a better generalization ability):
w∗
= arg min
w
N
i=1
L(yi, Φw(xi))+C w 2
.

Learning MLP
w∗
= arg min
w
N
i=1
.
Problem: MSE is not quadratic in w and thus some solutions can be
local minima.

Learning MLP
w∗
= arg min
w
N
i=1
.
Problem: MSE is not quadratic in w and thus some solutions can be
local minima.
Tuning the hyper-parameters, C and Q: simple validation was used to
tune ﬁrst C and Q.

SVM
SVM is also an algorithm based on penalized error loss minimization:
1 Basic linear SVM for regression: Φ(w,b) is of the form x → wT
x + b
with (w, b) solution of
arg min
N
i=1
L (yi, Φ(w,b)(xi)) + λ w 2
where
λ is a regularization (hyper) parameter (to be tuned);
L (y, ˆy) = max{|y − ˆy| − , 0} is an -insensitive loss function
See -insensitive loss function

SVM
1 Basic linear SVM for regression
2 Non linear SVM for regression are the same except that a non linear
(ﬁxed) transformation of the inputs is previously made: ϕ(x) ∈ H is
used instead of x.
Original space X Feature space H
Ψ (non linear)

SVM
used instead of x.
Kernel trick: in fact, ϕ is never explicit but used through a kernel,
K : Rd
× Rd
→ R. This kernel is used for K(xi, xj) = ϕ(xi), ϕ(xj) .
Original space X Feature space H
Ψ (non linear)

SVM
used instead of x.
Kernel trick: in fact, ϕ is never explicit but used through a kernel,
K : Rd
× Rd
→ R. This kernel is used for K(xi, xj) = ϕ(xi), ϕ(xj) .
Common kernel: Gaussian kernel
Kγ(u, v) = e−γ u−v 2
is known to have good theoretical properties both for accuracy and
generalization.

Learning SVM
Learning (w, b): w = N
i=1 αiK(xi, .) and b are calculated by an exact
optimization scheme (quadratic programming). The only step that can
be time consumming is the calculation of the kernel matrix:
K(xi, xj) for i, j = 1, . . . , n.

Learning SVM
K(xi, xj) for i, j = 1, . . . , n.
The resulting Φn
is known to be of the form:
Φn
(x) =
N
i=1
αiK(xi, x) + b
where only a few αi are non zero. The corresponding xi are called
support vectors.

Learning SVM
K(xi, xj) for i, j = 1, . . . , n.
The resulting Φn
is known to be of the form:
Φn
(x) =
N
i=1
αiK(xi, x) + b
where only a few αi are non zero. The corresponding xi are called
support vectors.
Tuning of the hyper-parameters, C = 1/λ, and γ: simple validation
has been used. To limit waste of time, has not been tuned in our
experiments but set to the default value (1) which ensured 0.5n
support vectors at most.

From regression tree to random forest
Example of a regression tree
|
SOCt < 0.095
PH < 7.815
SOCt < 0.025
FR < 130.45 clay < 0.185
SOCt < 0.025
SOCt < 0.145
FR < 108.45
PH < 6.5
4.366 7.100
15.010 8.975
2.685 5.257
26.260
28.070 35.900 59.330
Each split is made such that
the two induced subsets have
the greatest homogeneity pos-
sible.
The prediction of a ﬁnal node is
the mean of the Y value of the
observations belonging to this
node.

Random forest
Basic principle: combination of a large number of under-efﬁcient
regression trees (the prediction is the mean prediction of all trees).

Random forest
Basic principle: combination of a large number of under-efﬁcient
regression trees (the prediction is the mean prediction of all trees).
For each tree, two simpliﬁcations of the original method are performed:
1 A given number of observations are randomly chosen among the
training set: this subset of the training data set is called in-bag sample
whereas the other observations are called out-of-bag and are used to
control the error of the tree;
2 For each node of the tree, a given number of variables are randomly
chosen among all possible explanatory variables.
The best split is then calculated on the basis of these variables and of the
chosen observations. The chosen observations are the same for a given
tree whereas the variables taken into account change for each split.

Additional tools
OOB (Out-Of Bags) error: error based on the OOB predictions.
Stabilization of OOB error is a good indication that there is enough
trees in the forest.

Additional tools
Importance of a variable to help interpretation: for a given variable Xj
(j ∈ {1, . . . , p}), the importance of Xj
is the mean decrease in accuracy
obtained when the values of Xj
are randomized:
I(Xj
) = E R(Φn
(X(j)
), Y) − E (R(Φn
(X), Y))
in which X(j) = (X1
, . . . , X(j), . . . , Xp
), X(j) being the variable Xj
with
permuted values.

Additional tools
Importance of a variable to help interpretation: for a given variable Xj
(j ∈ {1, . . . , p}), the importance of Xj
is the mean decrease in accuracy
obtained when the values of Xj
are randomized. Importance is
estimated with OOB observations (see next slides for details)

Learning a random forest
Random forest are not very sensitive to hyper-parameters (number of
observations for each tree, number of variables for each split): the default
values have been used.

Learning a random forest
Random forest are not very sensitive to hyper-parameters (number of
observations for each tree, number of variables for each split): the default
values have been used.
The number of trees should be large enough for the mean squared error
based on out-of-sample observations to stabilize:
0 100 200 300 400 500
0246810
trees
Error
Out−of−bag (training)
Test

Importance estimation in random forests
OOB estimation for variable Xj
1: for b = 1 → B (loop on trees) do
2: permute values for (x
j
i
)i: xi T b return x
(j,b)
i
= (x1
i
, . . . , x
(j,b)
i
, . . . , x
p
i
),
x
(j,b)
i
permuted values
3: predict Φb
x
(j,b)
i
4: end for
5: return OOB estimation of the importance of Xj
1
B
B
b=1


1
|T b| xi T b
Φb
(x
(j,b)
i
) − yi
2
−
1
|T b| xi T b
Φb
(xi) − yi
2



Sommaire
2 Methodology
Consistency
Problem at stake
4 Results

Inﬂuence of the training sample size
5 6 7 8 9
0.50.60.70.80.91.0
N2O prediction
log size (training)
R2
LM1
LM2
Dace
SDR
ACOSSO
MLP
SVM
RF

Inﬂuence of the training sample size
5 6 7 8 9
0.60.70.80.91.0
N leaching prediction
log size (training)
R2
LM1
LM2
Dace
SDR
ACOSSO
MLP
SVM
RF

Computational time
Use LM1 LM2 Dace SDR Acosso
Train <1 s. 50 min 80 min 4 hours 65 min n
Prediction <1 s. <1 s. 90 s. 14 min 4 min.
Use MLP SVM RF
Train 2.5 hours 5 hours 15 min
Prediction 1 s. 20 s. 5 s.
Time for DNDC: about 200 hours with a desktop computer and about 2
days using cluster 7!

Further comparisons
Evaluation of the different step (time/difﬁculty)
Training Validation Test
LM1 ++ +
LM2 + +
ACOSSO = + -
SDR = + -
DACE = - -
MLP - - +
SVM = - -
RF + + +

Understanding which inputs are important
Example (N2O, RF):
q
q
q
q
q
q
q
q
q
q
q
2 4 6 8 10
51015202530
Rank
Importance(meandecreaseMSE)
pH
Nr N_MR
Nfix
N_FR
clay NresTmean BD rain
SOC
The variables SOC and PH are the most important for accurate
predictions.

Understanding which inputs are important
Example (N leaching, SVM):
q
q
q q
q
q
q
q
q q
q
2 4 6 8 10
050010001500
Rank
Importance(decreaseMSE)
N_FR
Nres pH
Nr
clay
rain
SOC
Tmean Nfix
BD
N_MR
The variables N_MR, N_FR, Nres and pH are the most important for
accurate predictions.

Thank you for your attention...
... questions?

Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Boser, B., Guyon, I., and Vapnik, V. (1992).
A training algorithm for optimal margin classiﬁers.
In 5th
annual ACM Workshop on COLT, pages 144–152. D. Haussler Editor, ACM
Press.
Breiman, L. (2001).
Random forests.
Machine Learning, 45(1):5–32.
Vapnik, V. (1995).
The Nature of Statistical Learning Theory.
Springer Verlag, New York, USA.
Villa-Vialaneix, N., Follador, M., Ratto, M., and Leip, A. (2012).
A comparison of eight metamodeling techniques for the simulation of N2O ﬂuxes and
N leaching from corn crops.
Environmental Modelling and Software, 34:51–66.

-insensitive loss function
Go back

A comparison of learning methods to predict N2O fluxes and N leaching

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A comparison of learning methods to predict N2O fluxes and N leaching

Similar to A comparison of learning methods to predict N2O fluxes and N leaching (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

A comparison of learning methods to predict N2O fluxes and N leaching