The Impact of Smoothness on Model Class Selection in Nonlinear System Identification
1. The Impact of Smoothness on Model Class
Selection in Nonlinear System Identification:
An Application of Derivatives in the RKHS
Y. Bhujwalla, V. Laurain, M. Gilson
6th July 2016
yusuf-michael.bhujwalla@univ-lorraine.fr
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 1 / 23
2. Introduction
The Data-Generating System
Measured data : DN = {(u1, y1), (u2, y2), . . . , (uN , yN )}.
Describes So, an unknown nonlinear system with function fo : X → R,
So :
yo,k = fo(xk)
yk = yo,k + eo,k
Where xk = [yk−1 · · · yk−na uk · · · uk−nb ]⊤
∈ X = Rna+nb+1
.
Parametric Models
Nθ low (fixed)
→ Physically interpretable
Choice of basis function?
→ Combinatorially hard problem X
Nonparametric Models
Nθ high (∼ data)
→ Not interpretable X
Can define a general model class.
→ Flexibility
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 2 / 23
3. Introduction
The Data-Generating System
Measured data : DN = {(u1, y1), (u2, y2), . . . , (uN , yN )}.
Describes So, an unknown nonlinear system with function fo : X → R,
So :
yo,k = fo(xk)
yk = yo,k + eo,k
Where xk = [yk−1 · · · yk−na uk · · · uk−nb ]⊤
∈ X = Rna+nb+1
.
Parametric Models
Nθ low (fixed)
→ Physically interpretable
Choice of basis function?
→ Combinatorially hard problem X
Nonparametric Models
Such as kernel methods :
Input
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Output
0
0.5
1
1.5
2
yo
kx
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 2 / 23
4. Outline
1 Kernel Methods in Nonlinear Identification
2 The Kernel Selection Problem
3 Smoothness in the RKHS
4 Simulation Examples
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 3 / 23
5. 1. Kernel Methods in Nonlinear Identification
Reproducing Kernel Hilbert Spaces
Hilbert Spaces
H is a space over a class of functions, f : X → R ∈ H :
· ∥ f ∥H
· ⟨ f , g ⟩H.
In system identification, H ⇔ model class.
Reproducing Kernels
H has a unique, associated kernel function, K : X × X → R, spanning the space
H.
The Reproducing Property states that f (x) can be explicitly represented as an
infinite sum in terms of the kernel function :
f (x) = ⟨ f , Kx⟩H =
∞
i=1
αiK(xi, x)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 4 / 23
6. 1. Kernel Methods in Nonlinear Identification
Identification in the RKHS
Identification in the RKHS
For ˆf ∈ H close to fo, ˆf should reflect observations :
ˆf = min
f
{ V( f ) = L(x, y, f (x)) }
However, infinitely many solutions ⇒ add constraint to model :
ˆf = min
f
{ V( f ) = L(x, y, f (x)) + g(∥ f ∥H) }
For such cost-functions, f (x) can be reduced to :
f (x) =
N
i=1
αiK(xi, x), α ∈ RN
· f (x) → a finite sum over the observations.
· The Representer Theorem (Schölkopf, Herbrich and Smola, 2001)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 5 / 23
7. 1. Kernel Methods in Nonlinear Identification
A Widely-Used Example
A Widely-Used Example
As an example minimise squared-error :
L(x, y, f (x)) = ∥y − f (x)∥2
2,
and use regularisation to avoid overparameterisation :
g(∥ f ∥H) = λ∥ f ∥2
H.
Giving :
Vf : V( f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H
⇒ αf = (K + λf I)−1
y
· Solution depends on
I. K and
II. λf
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 6 / 23
8. Outline
1 Kernel Methods in Nonlinear Identification
2 The Kernel Selection Problem
3 Smoothness in the RKHS
4 Simulation Examples
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 7 / 23
9. 2. The Kernel Selection Problem
Choosing a Kernel Function
Choosing a kernel function...
K defines the model class
Let X = R, and K be the Gaussian
RBF kernel :
K(xi, x) = exp −
∥x − xi∥2
σ2
.
Width (σ) defines smoothness of
the kernel function.
Hence σ determines the model
class !
Other kernels have different
hyperparameters, but they will still
influence H.
Input
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
Input
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
KxKx
σ1
σ2 > σ1
σ
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 8 / 23
10. 2. The Kernel Selection Problem
Implications of the Hyperparameter Selection
Estimation of 1D switching signal
using Vf = ∥y − f (x)∥2
2 + λf ∥f ∥2
H.
Many observations (N = 103
).
uk ∼ U(−1, 1).
Significant noise disturbances
(SNR = 5dB).
Two hyperparameters :
I. σ and
II. λ
-1 -0.5 0 0.5 1
-20
-10
0
10
20
30
fo(uk)
uk
FIGURE: Estimation of 1D switching
signal for different hyperparameter
values.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 9 / 23
11. 2. The Kernel Selection Problem
Implications of the Hyperparameter Selection
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
yoyo
yoyo
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfSD
ˆfSD
ˆfSD
ˆfSD
SMALL λ LARGE λ
SMALLσLARGEσ
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 10 / 23
12. 2. The Kernel Selection Problem
Implications of the Hyperparameter Selection
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
SMOOTHNESS
FLEXIBILITY
yoyo
yoyo
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfSD
ˆfSD
ˆfSD
ˆfSD
SMALL λ LARGE λ
SMALLσLARGEσ
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 10 / 23
13. 2. The Kernel Selection Problem
Summary
Summary
Vf : V(f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H.
Kernel framework very effective :
· flexible,
· well-understood.
However, choice of kernel often compromised (e.g. by noise).
⇒ Trade-off between flexibility and smoothness.
So, why regularise over ∥ f ∥H . . .
. . . when smoothness is often a more interesting property to control?
⇒ Desirable property in many models.
⇒ Characterises many systems.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 11 / 23
14. Outline
1 Kernel Methods in Nonlinear Identification
2 The Kernel Selection Problem
3 Smoothness in the RKHS
4 Simulation Examples
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 12 / 23
15. 3. Smoothness in the RKHS
Regularisation Using Derivatives
Proposition
Replace functional regularisation :
Vf : V(f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H,
With smoothness-enforcing regularisation :
VD : V(f ) = ∥y − f (x)∥2
2 + λD∥Df ∥2
H.
Now :
· Hence, smoothness controlled by regularisation.
· And, kernel hyperparameter removed from optimisation problem.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 13 / 23
16. 3. Smoothness in the RKHS
Regularisation Using Derivatives
Proposition
Replace functional regularisation :
Vf : V(f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H,
With smoothness-enforcing regularisation :
VD : V(f ) = ∥y − f (x)∥2
2 + λD∥Df ∥2
H.
Now :
· Hence, smoothness controlled by regularisation.
· And, kernel hyperparameter removed from optimisation problem.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 13 / 23
17. 3. Smoothness in the RKHS
Regularisation Using Derivatives
Proposition
Replace functional regularisation :
Vf : V(f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H,
With smoothness-enforcing regularisation :
VD : V(f ) = ∥y − f (x)∥2
2 + λD∥Df ∥2
H.
Now :
· Hence, smoothness controlled by regularisation.
· And, kernel hyperparameter removed from optimisation problem.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 13 / 23
18. 3. Smoothness in the RKHS
Derivatives in the RKHS
Derivatives in the RKHS
For f ∈ H, Df ∈ H (Zhou, 2008)
Hence, a derivative reproducing property can be defined :
Df = ⟨ f , DKx ⟩H
The Representer Theorem
Representer f (x) = N
i=1 αiK(xi, x) requires
g(∥ f ∥H) : a monotically increasing function of ∥ f ∥H
Clearly, ∥Df ∥H g(∥ f ∥H) ⇒ representer is suboptimal for VD.
However, if system is well-excited, f (x) = N
i=1 αiK(xi, x) can be used.
However, it loosely preserves the bias-variance properties of Vf
lim
λ→∞
f (x) = 0, ∀x ∈ R.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 14 / 23
19. 3. Smoothness in the RKHS
Derivatives in the RKHS
A Closed-Form Solution
Using derivative reproducing property, ∥Df ∥H can be defined :
∥Df ∥2
H = α⊤
D(1, 1)
Kα,
where
D(1, 1)
K(xi, xj) =
∂2
K(xi, xj)
∂xj ∂xi
.
Permitting a closed-form solution :
αD = K⊤
K + λDD(1, 1)
K
−1
K⊤
y.
As per Vf ⇒ αf = (K + λf I)−1
y.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 15 / 23
20. Outline
1 Kernel Methods in Nonlinear Identification
2 The Kernel Selection Problem
3 Smoothness in the RKHS
4 Simulation Examples
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 16 / 23
21. 4. Simulation Examples
Example 1 : Effect of the Regularisation
Estimation of 1D switching signal
using Vf and VD.
Many observations (N = 103
).
uk ∼ U(−1, 1).
Significant noise disturbances
(SNR = 5dB).
Gaussian RBF kernel, with σ = 0.01.
Varying levels of regularisation
(through λf , λD).
-1 -0.5 0 0.5 1
-20
-10
0
10
20
30
fo(uk)
uk
FIGURE: Estimation of 1D switching
signal for different λ values.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 17 / 23
22. 4. Simulation Examples
Example 1 : Effect of the Regularisation
⇒ Negligible regularisation (very small λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 18 / 23
23. 4. Simulation Examples
Example 1 : Effect of the Regularisation
⇒ Light regularisation (small λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 18 / 23
24. 4. Simulation Examples
Example 1 : Effect of the Regularisation
⇒ Moderate regularisation.
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 18 / 23
25. 4. Simulation Examples
Example 1 : Effect of the Regularisation
⇒ Heavy regularisation (large λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 18 / 23
26. 4. Simulation Examples
Example 1 : Effect of the Regularisation
⇒ Excessive regularisation (very large λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 18 / 23
27. 4. Simulation Examples
Example 2 : 1D Structural Selection
Identification of two unknown systems (X ∈ [−1, 1], SNR = 10dB, N = 103
).
Vf : λ, σ optimised using cross-validation.
VD : λ optimised using cross-validation, σ set based on data.
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
f1
o(uk)
uk
FIGURE: S1
o : Smooth
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
f2
o(uk)
uk
FIGURE: S2
o : Nonsmooth
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 19 / 23
28. 4. Simulation Examples
Example 2 : Smooth S1
o
Using a small kernel, VD can reconstruct a smooth function.
Not feasible using Vf - needs kernel smoothing effect.
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: Vf : R( f)
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 20 / 23
29. 4. Simulation Examples
Example 2 : Nonsmooth S2
o
Using a small kernel, VD can detect structural nonlinearity.
However, Vf is too smooth, as σ must counteract noise.
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: Vf : R( f)
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: VD : R(Df)
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 21 / 23
30. Conclusions
RKHS in Nonlinear Identification
Flexible framework : attractive for nonlinear identification.
Smoothness controlled by kernel function and regularisation (σ and λf )
⇒ Constrained kernel function.
Derivatives in the RKHS
Smoothness controlled by regularisation (λD).
⇒ Simpler steering of the smoothness.
Simpler hyperparameter optimisation (just λD) and increased model flexibility.
⇒ Through use of a smaller kernel (small σ).
However, relies on a suboptimal representer.
⇒ Nonetheless, promising results have been obtained.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 22 / 23
31. The Impact of Smoothness on Model Class
Selection in Nonlinear System Identification:
An Application of Derivatives in the RKHS
Y. Bhujwalla, V. Laurain, M. Gilson
6th July 2016
yusuf-michael.bhujwalla@univ-lorraine.fr
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 23 / 23
32. A. Bibliography
Alternative Smoothness-Enforcing Optimisation Schemes
Sobolev Spaces (Wahba, 1990 ; Pillonetto et al, 2014)
∥f ∥Hk
=
m
i=0 X
di
f (x)
dxi
2
dx
Identification using derivative observations (Zhou, 2008; Rosasco et al, 2010)
Vobvs( f ) = ∥y − f (x)∥2
2 + γ1
dy
dx
−
df (x)
dx
2
2
+ · · · γm
dm
y
dxm
−
dm
f (x)
dxm
2
2
+ λ ∥f ∥H
Regularization Using Derivatives (Rosasco et al, 2010; Lauer, Le and Bloch,
2012; Duijkers et al, 2014)
VD( f ) = ∥y − f (x)∥2
2 + λ∥Dm
f ∥p.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 23 / 23
33. A. Bibliography
Literature Review
Kernel Methods in Machine Learning and System Identification
· Kernel methods in system identification, machine learning and function
estimation : A survey, G. Pillonetto, F. Dinuzzo, T. Chen, G. D. Nicolao and L.
Ljung, 2014.
· Learning with Kernels, B. Schölkopf, R. Herbrich and A. J. Smola, 2002.
· Gaussian Processes for Machine Learning, C. Rasmussen and C. Williams,
2006.
Reproducing Kernel Hilbert Spaces
· Theory of Reproducing Kernels, N. Aronszajn, 1950.
· A Generalized Representer Theorem, B. Schölkopf, R. Herbrich and A. J. Smola,
2001.
· Derivative reproducing properties for kernel methods in learning theory, D. Zhou,
2008.
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 23 / 23