An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification

An RKHS Approach to
Systematic Kernel Selection
in Nonlinear System Identiﬁcation
Y. Bhujwalla, V. Laurain, M. Gilson
55th IEEE Conference on Decision and Control
yusuf-michael.bhujwalla@univ-lorraine.fr
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 1 / 14

Introduction
Problem Description
Measured data :
DN = {(u1, y1), (u2, y2), . . . , (uN, yN )}
Describing an unknown system :
So :
yo,k = fo(xk), fo : X → R
yk = yo,k + eo,k, eo,k ∼ N(0, σ2
e )
- xk = [ yk−1 · · · yk−na u1,k · · · u1,k−nb u2,k · · · unu,k−nb ]⊤
∈ X = Rna+nu (nb+1)

Introduction
Modelling Objective
Aim : to choose the simplest model from a candidate set of models that accurately
describes the system :
Mopt : Accuracy (Data) vs Simplicity (Model)
⏐
⏐
⏐
⏐
Vf : V( f ) = N
k=1 ( yk − f (xk))2
+ g( f )

Introduction
Modelling Objective
⏐
⏐
⏐
⏐
Vf : V( f ) = N
k=1 ( yk − f (xk))2
+ g( f )
Q1 : How to choose the simplest accurate model?
- Often g( f ) = λ ∥ f ∥2
H - ensure uniqueness of the solution
- λ → controls the bias-variance trade-off

Introduction
Modelling Objective
⏐
⏐
⏐
⏐
Vf : V( f ) = N
k=1 ( yk − f (xk))2
+ g( f )
Q1 : How to choose the simplest accurate model?
- Often g( f ) = λ ∥ f ∥2
H - ensure uniqueness of the solution
- λ → controls the bias-variance trade-off
Q2 : How to determine a suitable set of candidate models. . . ?

Outline
1. Kernel Methods in Nonlinear Identiﬁcation
2. Model Selection Using Derivatives
3. Smoothness-Enforcing Regularisation
4. Application : Estimation of Locally Nonsmooth Functions

Input
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Output
0
0.5
1
1.5
2
ˆf
kx
→ Model :
Ff : f (x) =
N
i=1
αi kxi (x)
→ Nonparametric (nθ ∼ N)
→ Flexible : M deﬁned through choice of K
→ Height : α (model parameters)
→ Width : σ (kernel hyperparameter)

Identiﬁcation in the RKHS
Reproducing Kernel Hilbert Spaces
Kernel function deﬁnes the model class :
K ↔ H
Hence, functions can be represented in terms of kernels :
f (x) = ⟨ f , kx⟩H (1)

The Kernel Selection Problem
Choosing an overly ﬂexible model class (a small kernel) :
FIGURE: Flexible Model Class

Choosing an overly ﬂexible model class (a small kernel) :
FIGURE: Flexible Model Class
-1 -0.5 0 0.5 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
fo
y
ˆf
kx
FIGURE: High Variance

Choosing an overly constrained model class (a large kernel) :
FIGURE: Constrained Model Class

Choosing an overly constrained model class (a large kernel) :
FIGURE: Constrained Model Class
-1 -0.5 0 0.5 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
fo
y
ˆf
kx
FIGURE: Model Biased

Why not just choose the ’optimal’ model class ?
FIGURE: Optimal Model Class

FIGURE: Optimal Model Class
-1 -0.5 0 0.5 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
fo
y
ˆf
kx
FIGURE: Optimal Model

• This is, in general, what we try to do.
• However, Hopt is unknown.
• Optimisation over one hyperparameter - not that difﬁcult.
• Optimisation over multiple model structures, kernel functions and
hyperparameters → more difﬁcult.
-1 -0.5 0 0.5 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
fo
y
ˆf
kx

Outline

But, note that many properties of K are encoded into its derivatives, e.g.
Smoothness f (x) = ax2
+ bx + c =⇒ d3
f(x)
dx3
∀x
= 0
f (x) = g1(x) [ x < x∗
] + g2(x) [ x > x∗
] =⇒ ∃ d f(x)
dx
∀x̸=x∗
Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2
f( x1,x2 )
∂x2
1 ∀x1
= 0
Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2
f( x1,x2 )
∂x1 ∂x1 ∀x1,x2
= 0

Incorporating this information into the problem formulation allows the model selection
can be transferred from an optimisation over K. . .
. . . to an explicit regularisation problem over derivatives, using an a priori ﬂexible
model class deﬁnition.

Outline

Problem Formulation
Here we consider X = R → where the kernel optimisation is reduced to a
smoothness selection problem.
What would we like to do ?
Replace exisiting functional norm regularisation. . .
Vf : V( f ) =
N
k=1
( yk − f (xk))2
+ λ ∥ f ∥2
H

Problem Formulation
Vf : V( f ) =
N
k=1
( yk − f (xk))2
+ λ ∥ f ∥2
H
With a smoothness-penalty in the cost-function. . .
VD : V( f ) =
N
k=1
( yk − f (xk))2
+ λ ∥Df ∥2
H

Problem Formulation
Vf : V( f ) =
N
k=1
( yk − f (xk))2
+ λ ∥ f ∥2
H
With a smoothness-penalty in the cost-function. . .
VD : V( f ) =
N
k=1
( yk − f (xk))2
+ λ ∥Df ∥2
H
How ?
- ∥Df ∥2
H → known (D. X. Zhou, 2008)
- f (x) for VD → unknown

An Extended Representer of f (x)
A ﬁnite representer for VD does not exist.
But, by adding kernels along X , an approximate formulation can be deﬁned :
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
Observations
Obs Kernels
∥ f∥2
FIGURE: N = 2
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
Observations
Obs Kernels
Add Kernels
∥Df∥2
FIGURE: (N, P) = (2, 8)
FD : f (x) =
N
i=1
αi kxi (x) +
P
j=1
α∗
j kx∗
j
(x)

Choosing the Kernel Width
Examination of the kernel density allows us to make an a priori choice of kernel width :
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
ˆf
kx
FIGURE: ρk = 0.4
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
ˆf
kx
FIGURE: ρk = 0.5
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
ˆf
kx
FIGURE: ρk = 0.6
Hence, for a given P we can deﬁne the maximally ﬂexible model class for a given
problem.

Outline

4. Application
Estimation of Locally Nonsmooth Functions
In VD, smoothness ∼ regularisation.
Hence, by introducing weights into the loss-function, importance of the regularisation
can be varied across X :
Vw : V( f ) =
N
i=1
(wk yk − wk f (xk))2
+ λ∥Df ∥2
H,
How to determine the weights ?
Relative to a particular modelling objective, e.g.
• wk ∼ ∥D ˆf(0)(xk)∥2
2 for piecewise constant structures, or
• wk ∼ ∥D2 ˆf(0)(xk)∥2
2 for piecewise linear structures.

4. Application
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
yo
FIGURE: Noise-Free System
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
y
FIGURE: Noisy System

4. Application
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
yo
ˆfMED
BIAS + SDEV
FIGURE: Vf : R( f)
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
yo
ˆfMED
BIAS + SDEV
FIGURE: VD : R(Df)

4. Application
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
yo
ˆfMED
BIAS + SDEV
FIGURE: Vf : R( f)
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
yo
ˆfMED
BIAS + SDEV
FIGURE: Vw : R(Df)

Conclusions
Objectives :
• To simplify model selection in nonlinear identification.
• By shifting the problem to a regularisation over functional derivatives.
→ Allowing the definition of an a priori flexible model class.

Conclusions
Objectives :
This presentation :
• First step ⇒ consider a simple example.
→ Model selection ⇔ smoothness detection.
→ Kernel selection ⇔ hyperparameter optimisation.

Conclusions
Objectives :
This presentation :
• First step ⇒ consider a simple example.
→ Model selection ⇔ smoothness detection.
→ Kernel selection ⇔ hyperparameter optimisation.
Current/Future Research :
• Application to dynamical, control-oriented problems (e.g. linear
parameter-varying identiﬁcation)
• Investigation of more complex model selection problems (e.g. detection of
linearities, separability. . . ).

A. Bibliography
• Sobolev Spaces (Wahba, 1990; Pillonetto et al, 2014)
∥ f ∥Hk
=
m
i=0 X
di
f (x)
dxi
2
dx
• Identiﬁcation using derivative observations (Zhou, 2008 ; Rosasco et al,
2010)
Vobvs( f ) = ∥y − f (x)∥2
2 + γ1
dy
dx
−
df (x)
dx
2
2
+ · · · γm
dm
y
dxm
−
dm
f (x)
dxm
2
2
+ λ ∥f ∥H
• Regularization Using Derivatives (Rosasco et al, 2010; Lauer, Le and Bloch,
2012; Duijkers et al, 2014)
VD( f ) = ∥y − f (x)∥2
2 + λ∥Dm
f ∥p.

B. Choosing the Kernel Width
The Smoothness-Tolerance Parameter
ρk =
σ
∆x∗
, ∆x∗ =
x∗
max − x∗
min
P
, ϵˆf = 100 × 1 −
∥ˆf ∥inf
C
%.
Kernel Density (ρ)
10-2
10-1
100
SmoothnessTolerance(ϵ%)
10
-15
10
-10
10
-5
10
0
ϵ(ρ)
ˆϵ
FIGURE: Selecting an appropriate kernel using ϵ

C. Effect of the Regularisation
⇒ Negligible regularisation (very small λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Light regularisation (small λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Moderate regularisation.
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Heavy regularisation (large λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Excessive regularisation (very large λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

D. Further Examples : Detecting Piecewise Structures
So : Noise-free and observed data
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
-0.49039
0
1
1.4358
FIGURE: y(x1, x2)

Results M1 : (Vf , Ff )
FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV

Results M2 : (VD, FD)

Results M3 : (Vw, FD)

E. Further Examples : Enforcing Separability
f ( x1, x2 )
λ
−→ f1(x1) + f2(x2)
FIGURE: VDX : R(∂x1 ∂x2 f)

An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification

Similar to An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification (20)

Recently uploaded

Recently uploaded (20)

An RKHS Approach to Systematic Kernel Selection in Nonlinear System Identification