1.
A DYNAMIC FORMULATION OF THE PATTERN RECOGNITION
PROBLEM
M.B. Shavlovsky1
, O.V. Krasotkina2
, V.V. Mottl3
1
Moscow Institute of Physics and Technology
Dolgoprudny, Moscow Region, 141700, Institutsky Pereulok, 9, shavlovsky@yandex.ru
2
Tula State University
Tula, 300600, Lenin Ave., 92, krasotkina@uic.tula.ru
3
Computing Center of the Russian Academy of Sciences
Moscow, 119333, Vavilov St., 40, vmottl@yandex.ru
The classical learning problem of pattern recognition in a finite-dimensional linear space
of real-valued features is studied under the conditions of a non-stationary universe. The
simplest statement of this problem with two classes of objects is considered under the as-
sumption that the instantaneous property the universe is completely expressed by a dis-
criminant hyperplane whose parameters are sufficiently slowly changing in time. In this
case, any object has to be considered along with the time marker which specifies when the
object was selected from the universe, and the training set becomes, actually, a time se-
ries. The training criterion of non-stationary pattern recognition is formulated as a genera-
lization of the classical Support Vector Machine. The respective numerical algorithm has
the computational complexity proportional to the length of the training time series.
Introduction
The aim of this study is creation of the main
mathematical framework and simplest algo-
rithms for solving the typical practical prob-
lems of pattern recognition learning in un-
iverses whose properties are changing in time.
The commonly known classical statement of
the pattern recognition problem is based on the
tacit assumption that the properties of the un-
iverse at the moment of decision making re-
main the same as when the training set had
been formed. The more realistic assumption on
the non-stationarity of the universe, which as
accepted in this work, inevitably leads to the
necessity to analyze a sequence of samples at
some time moments and find different recogni-
tion rules for them.
This work is supported by grants of the Russian
Foundation for Basic Research No. 05-01-00679
and 06-01-00412.
The classical stationary pattern recognition
problem with two classes of objects: The
support Vector Machine
Let each object of the universe be
represented by a point in the linear space of
features ( ) x (1) ( )
( ),..., ( )n n
x x R , and
its hidden membership in one of two classes be
specified by the value of the class index
( ) 1, 1y . The classical approach to the
training problem developed by V. Vapnik [1]
is based on the treating the model of the un-
iverse in the form of a discriminant function
defined by a hyperplane having a priori un-
known direction vector and threshold:
( ) ( )T
f b x a x is primarily 0 if
( ) 1y , and 0 if ( ) 1y .
The unknown parameters of the hyperplane are
to be estimated via analyzing a training set of
objects , 1,...,j j N represented by their
2.
feature vectors and class-membership indices,
so that the training set as a whole is a finite set
of pairs ( , ), 1,...,n
j jy j N x R R . The
commonly adopted training principle is that of
the optimal discriminant hyperplane to be cho-
sen from the criterion of maximizing the num-
ber of points which are classified correctly
with a guaranteed margin conventionally taken
to be equal to unity:
1
( , , , 1,..., ) min,
( ) 1 , 0, 1,..., .
N
T
j j
j
T
j j j j
J b j N C
y b j N
a a a
a x
(1)
The notion of time is completely absent here.
The mathematical model of a non-
stationary universe and the main kinds of
the training problems
The principal novelty of the concept of the non-
stationary universe is involving the time factor
t . It is assumed that the main property of the
non-stationary universe is completely expressed
by the time-varying discriminant hyperplane
which, in its turn, is completely determined by
direction vector and the threshold both being
functions of time: ( ) ( )T
t t tf b x a x pri-
marily 0 if ( ) 1y and 0 if ( ) 1y .
Any object is to be considered always
along with the time mark of its appearance
( , )t . As a result, the training set gains the
structure of a set of triples instead of pairs:
( , , ), 1,...,n
j j jy t j N x R R . If we order
the objects as they appear, it is appropriate to
speak rather about the training sequence than
training set, and consider it as a time series
with varying time steps, in the general case.
The hidden discriminant hyperplane has dif-
ferent values of the direction vector and thre-
shold at different time moments jt . So, there
exists a two-component time series with one
hidden and one observable component, respec-
tively, ( , )j jba and ( , )j jyx .
The dynamic formulation turns the training
problem into that of two-component time se-
ries analysis, in which it is required to estimate
the hidden component from the observable
one. This is a standard signal (time series)
analysis problem whose specificity boils down
to the assumed model of the relationship be-
tween the hidden and the observable compo-
nent. In accordance with the classification in-
troduced by N. Wiener [2], it is natural to dis-
tinguish between, at last, two kinds of training
problems as those of estimating the hidden
component.
The problem of filtration of the training time
series. Let a new object appear at the time mo-
ment jt when the feature vectors and class-
membership indices of the previous are already
registered 1 1...,( , ),( , )j j j jy y x x including the
current moment 1(..., , )j jt t . It is required to
recurrently estimate the parameters of the dis-
criminant hyperplane ˆˆ( , )j jba at each time
moment jt immediately in the process of obser-
vation.
The problem of interpolation. Let the training
time series be completely registered in some
time interval 1 1( , ),...,( , ){ }N Ny yx x before its
processing starts. It is required to estimate the
time-varying parameters of the discriminant
hyperplane in the entire observation interval
1 1
ˆ ˆˆ ˆ( , ),...,( , ){ }N Nb ba a .
It is assumed that the parameters of the discri-
minant hyperplane ta and tb are changing
slowly in the sense that the values
1 1
1
( ) ( )T
j j j j
j jt t
a
a a a a
и
2
1
1
( )j j
j j
b
b b
t t
are, as a rule, sufficiently small. This assump-
tion prevents from the degeneration of the fil-
tration and interpolation problem into a collec-
tion of independent ill-posed two-class train-
ing problems each with a single observation.
From the formal point of view, the interpola-
tion-based estimate of the discriminant hyper-
plane parameters ˆˆ( , )N Nba obtained at the last
point of the observation interval is just the so-
lution of the filtration problem at this time
moment. However, the essence of the filtration
problem is the requirement of evaluating the
estimates in the on-line mode immediately as
3.
the observations are coming one after another
without solving, each time, the interpolation
problem for the time-series of increasing
length.
The training criterion in the interpolation
mode
We consider here only the interpolation prob-
lem. The proposed formulation of this problem
differs from a collection of classical SVM-
based criteria (1) for consecutive time mo-
ments only by the presence of additional terms
which penalize the difference between adja-
cent values of the hyperplane parameters
1 1( , )j jb a and ( , )j jba :
1 1 1
1 1
2
12
1
( , , , 1,..., )
( ) ( ) ( )
,
( ) 1 , 0, 1,..., .
j j jj j j
N N
T
j j j j j j
j j
T
N
j jj
T
j j j j j j
b
t t t t t t
J b j N C
N
D D b b
t t
y b j N
a
a a a
a a a a
a x
(2)
The coefficients 0D a
and 0b
D are hy-
per-parameters which preset the desired level
of smoothing the instantaneous parameters of
the discriminant hyperplane.
The criterion (2) implements the concept of
the optimal sufficiently smooth sequence of
discriminant hyperplanes in contrast to the
concept of the only optimal hyperplane in (1).
The sought-for hyperplanes have to provide
the correct classification of the feature vectors
for as many time moments as possible with a
guaranteed margin taken equal to unity just
like in (1).
The training algorithm
Just as the classical training problem, the dy-
namic problem (2) is that of quadratic pro-
gramming, but it contains ( 1)N n N va-
riables in contrast to ( 1)n N variables in
(1). It is known that the computational com-
plexity of the quadratic programming problem
of general kind is proportional to the cube of
the number of variables, i.e. the dynamic prob-
lem appears, at first glance, to be essentially
more complicated than the classical one.
However, the goal function of the dynamic
problem ( , , , 1,..., )j j jJ b j N a is pair-wise
separable, i.e. is representable as the sum of
partial functions each of which depends only
on one or two variables associated with one or
two adjacent time moments. This circumstance
makes it possible to build an algorithm which
numerically solves the problem in time propor-
tional to the length of the training time series
N .
The application of the Kuhn-Tacker theorem
to the dynamic problem (2) turns it into the
dual form with respect to the Lagrange multip-
liers 0j at the inequality constraints
( ) 1T
j j j j jy b a x :
1
1 1 1
1
( ,..., )
1
( ) max,
2
0, 0 2, 1,..., .
N
N N N
T
j j l j jl l jl j l
j j l
N
j j j
j
W
y y f
y C j N
x Q x (3)
Matrices jlQ ( )n n and ( )jlfF ( )N N do
not depend here from the training time series
and are determined only by the coefficients
Da
and b
D which penalize the unsmoothness
of the sequence of hyperplane parameters, re-
spectively, direction vectors and thresholds in
(2).
Theorem. The solution of the training problem
(2) is completely determined by the training
time series and the values of Lagrange multip-
liers 1( ,..., )N obtained as the solution of the
dual problem (3):
: 0
ˆ
l
j l l jl l
l
y
a Q x ,
: 0
ˆ
l
j l l jl
l
b b y f
, (4)
: 2:0 2
:0 2 : 0
' ''
, ' ,
2
'' ( ).
jj
j l
j
j
T
j l l l jl j jl
j Cj C
j C l
b b C
b b y
b y f
x Q x
(5)
It is seen from these formulas that the solution
of the dynamic training problem depends only
on those elements of the training time series
( , )j jyx whose Lagrange multipliers have ob-
tained positive values 0j . It is natural to
call the feature vectors of the respective ob-
jects the support vectors. So, we have come to
some generalization of the Support Vector
Machine [1] which follows from the concept
of the optimal discriminant hyperplane (1).
4.
The classical training problem is a particular
case of the problem (2) when the penalties on
the time variation of the hyperplane parameters
infinitely grow D a
and b
D . In this
case, we have jl Q I , 0jlf , and the dual
problem (3) turns into the classical dual prob-
lem [1] which corresponds to the initial prob-
lem (1):
1
1 1 1
1
( ,..., )
1
( ) max,
2
0, 0 2, 1,..., ,
N
N N N
T
j j l j l j l
j j l
N
j j j
j
W
y y
y C j N
x x
The formulas (4) and (5) will determine the
training result in accordance with the classical
support vector method 1
ˆ ˆ ˆ... N a a a and
1
ˆ ˆ ˆ... Nb b b :
: 0
ˆ
j
j j j
j
y
a x ,
: 2:0 2
:0 2 : 0
' ''
, ' ,
2
'' .
jj
j l
j
j
T
j l l l j
j Cj C
j C l
b b C
b b y
b y
x x
Despite the fact that the dual problem (3) is not
pair-wise separable, the pair-wise separability
of the initial problem (2) makes it possible to
compute the gradient of the goal function
1( ,..., )NW at each point 1( ,..., )N
and then to find the optimal admissible max-
imization direction relative to the constraints
via an algorithm of the linear computational
complexity with respect to the length of the
training time series. In particular, the standard
steepest descent method of solving quadratic
programming problems [3], being applied to
the function 1( ,..., )NW , yields a generali-
zation of the known SMO algorithm (Sequen-
tial Minimum Optimization) [4] which is typi-
cally used when solving dual problems in
SVM.
References
1. Vapnik V. Statistical Learning Theory. John-Wiley
& Sons, Inc. 1998.
2. Wiener N. Extrapolation, Interpolation, and Smooth-
ing of Stationary Random Time Series with Engi-
neering Applications. Technology Press of MIT,
John Wiley & Sons, 1949, 163 p.
3. Bazaraa M.S., Sherali H.D., Shetty C.M. Nonlinear
Programming: Theory and Algorithms. John Wiley
& Sons, 1993.
4. Platt J.C. Fast training of support vector machines
using sequential minimal optimization. Advances in
Kernel Methods: Support Vector Learning, MIT
Press, Cambridge, MA, 1999.
Be the first to comment