Pria 2007


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Pria 2007

  1. 1. A DYNAMIC FORMULATION OF THE PATTERN RECOGNITION PROBLEM M.B. Shavlovsky1 , O.V. Krasotkina2 , V.V. Mottl3 1 Moscow Institute of Physics and Technology Dolgoprudny, Moscow Region, 141700, Institutsky Pereulok, 9, 2 Tula State University Tula, 300600, Lenin Ave., 92, 3 Computing Center of the Russian Academy of Sciences Moscow, 119333, Vavilov St., 40, The classical learning problem of pattern recognition in a finite-dimensional linear space of real-valued features is studied under the conditions of a non-stationary universe. The simplest statement of this problem with two classes of objects is considered under the as- sumption that the instantaneous property the universe is completely expressed by a dis- criminant hyperplane whose parameters are sufficiently slowly changing in time. In this case, any object has to be considered along with the time marker which specifies when the object was selected from the universe, and the training set becomes, actually, a time se- ries. The training criterion of non-stationary pattern recognition is formulated as a genera- lization of the classical Support Vector Machine. The respective numerical algorithm has the computational complexity proportional to the length of the training time series. Introduction  The aim of this study is creation of the main mathematical framework and simplest algo- rithms for solving the typical practical prob- lems of pattern recognition learning in un- iverses whose properties are changing in time. The commonly known classical statement of the pattern recognition problem is based on the tacit assumption that the properties of the un- iverse at the moment of decision making re- main the same as when the training set had been formed. The more realistic assumption on the non-stationarity of the universe, which as accepted in this work, inevitably leads to the necessity to analyze a sequence of samples at some time moments and find different recogni- tion rules for them.  This work is supported by grants of the Russian Foundation for Basic Research No. 05-01-00679 and 06-01-00412. The classical stationary pattern recognition problem with two classes of objects: The support Vector Machine Let each object of the universe  be represented by a point in the linear space of features ( ) x  (1) ( ) ( ),..., ( )n n x x  R , and its hidden membership in one of two classes be specified by the value of the class index  ( ) 1, 1y    . The classical approach to the training problem developed by V. Vapnik [1] is based on the treating the model of the un- iverse in the form of a discriminant function defined by a hyperplane having a priori un- known direction vector and threshold:  ( ) ( )T f b   x a x is primarily 0 if ( ) 1y   , and 0 if ( ) 1y    . The unknown parameters of the hyperplane are to be estimated via analyzing a training set of objects  , 1,...,j j N  represented by their
  2. 2. feature vectors and class-membership indices, so that the training set as a whole is a finite set of pairs  ( , ), 1,...,n j jy j N  x R R . The commonly adopted training principle is that of the optimal discriminant hyperplane to be cho- sen from the criterion of maximizing the num- ber of points which are classified correctly with a guaranteed margin conventionally taken to be equal to unity: 1 ( , , , 1,..., ) min, ( ) 1 , 0, 1,..., . N T j j j T j j j j J b j N C y b j N              a a a a x (1) The notion of time is completely absent here. The mathematical model of a non- stationary universe and the main kinds of the training problems The principal novelty of the concept of the non- stationary universe is involving the time factor t . It is assumed that the main property of the non-stationary universe is completely expressed by the time-varying discriminant hyperplane which, in its turn, is completely determined by direction vector and the threshold both being functions of time:  ( ) ( )T t t tf b   x a x pri- marily 0 if ( ) 1y   and 0 if ( ) 1y    . Any object  is to be considered always along with the time mark of its appearance ( , )t . As a result, the training set gains the structure of a set of triples instead of pairs:  ( , , ), 1,...,n j j jy t j N  x R R . If we order the objects as they appear, it is appropriate to speak rather about the training sequence than training set, and consider it as a time series with varying time steps, in the general case. The hidden discriminant hyperplane has dif- ferent values of the direction vector and thre- shold at different time moments jt . So, there exists a two-component time series with one hidden and one observable component, respec- tively, ( , )j jba and ( , )j jyx . The dynamic formulation turns the training problem into that of two-component time se- ries analysis, in which it is required to estimate the hidden component from the observable one. This is a standard signal (time series) analysis problem whose specificity boils down to the assumed model of the relationship be- tween the hidden and the observable compo- nent. In accordance with the classification in- troduced by N. Wiener [2], it is natural to dis- tinguish between, at last, two kinds of training problems as those of estimating the hidden component. The problem of filtration of the training time series. Let a new object appear at the time mo- ment jt when the feature vectors and class- membership indices of the previous are already registered  1 1...,( , ),( , )j j j jy y x x including the current moment 1(..., , )j jt t . It is required to recurrently estimate the parameters of the dis- criminant hyperplane ˆˆ( , )j jba at each time moment jt immediately in the process of obser- vation. The problem of interpolation. Let the training time series be completely registered in some time interval 1 1( , ),...,( , ){ }N Ny yx x before its processing starts. It is required to estimate the time-varying parameters of the discriminant hyperplane in the entire observation interval 1 1 ˆ ˆˆ ˆ( , ),...,( , ){ }N Nb ba a . It is assumed that the parameters of the discri- minant hyperplane ta and tb are changing slowly in the sense that the values 1 1 1 ( ) ( )T j j j j j jt t         a a a a a и 2 1 1 ( )j j j j b b b t t       are, as a rule, sufficiently small. This assump- tion prevents from the degeneration of the fil- tration and interpolation problem into a collec- tion of independent ill-posed two-class train- ing problems each with a single observation. From the formal point of view, the interpola- tion-based estimate of the discriminant hyper- plane parameters ˆˆ( , )N Nba obtained at the last point of the observation interval is just the so- lution of the filtration problem at this time moment. However, the essence of the filtration problem is the requirement of evaluating the estimates in the on-line mode immediately as
  3. 3. the observations are coming one after another without solving, each time, the interpolation problem for the time-series of increasing length. The training criterion in the interpolation mode We consider here only the interpolation prob- lem. The proposed formulation of this problem differs from a collection of classical SVM- based criteria (1) for consecutive time mo- ments only by the presence of additional terms which penalize the difference between adja- cent values of the hyperplane parameters 1 1( , )j jb a and ( , )j jba : 1 1 1 1 1 2 12 1 ( , , , 1,..., ) ( ) ( ) ( ) , ( ) 1 , 0, 1,..., . j j jj j j N N T j j j j j j j j T N j jj T j j j j j j b t t t t t t J b j N C N D D b b t t y b j N                           a a a a a a a a a x (2) The coefficients 0D a and 0b D  are hy- per-parameters which preset the desired level of smoothing the instantaneous parameters of the discriminant hyperplane. The criterion (2) implements the concept of the optimal sufficiently smooth sequence of discriminant hyperplanes in contrast to the concept of the only optimal hyperplane in (1). The sought-for hyperplanes have to provide the correct classification of the feature vectors for as many time moments as possible with a guaranteed margin taken equal to unity just like in (1). The training algorithm Just as the classical training problem, the dy- namic problem (2) is that of quadratic pro- gramming, but it contains ( 1)N n N  va- riables in contrast to ( 1)n N  variables in (1). It is known that the computational com- plexity of the quadratic programming problem of general kind is proportional to the cube of the number of variables, i.e. the dynamic prob- lem appears, at first glance, to be essentially more complicated than the classical one. However, the goal function of the dynamic problem ( , , , 1,..., )j j jJ b j N a is pair-wise separable, i.e. is representable as the sum of partial functions each of which depends only on one or two variables associated with one or two adjacent time moments. This circumstance makes it possible to build an algorithm which numerically solves the problem in time propor- tional to the length of the training time series N . The application of the Kuhn-Tacker theorem to the dynamic problem (2) turns it into the dual form with respect to the Lagrange multip- liers 0j  at the inequality constraints ( ) 1T j j j j jy b  a x :   1 1 1 1 1 ( ,..., ) 1 ( ) max, 2 0, 0 2, 1,..., . N N N N T j j l j jl l jl j l j j l N j j j j W y y f y C j N                       x Q x (3) Matrices jlQ ( )n n and ( )jlfF ( )N N do not depend here from the training time series and are determined only by the coefficients Da and b D which penalize the unsmoothness of the sequence of hyperplane parameters, re- spectively, direction vectors and thresholds in (2). Theorem. The solution of the training problem (2) is completely determined by the training time series and the values of Lagrange multip- liers 1( ,..., )N  obtained as the solution of the dual problem (3): : 0 ˆ l j l l jl l l y    a Q x , : 0 ˆ l j l l jl l b b y f      , (4) : 2:0 2 :0 2 : 0 ' '' , ' , 2 '' ( ). jj j l j j T j l l l jl j jl j Cj C j C l b b C b b y b y f                    x Q x (5) It is seen from these formulas that the solution of the dynamic training problem depends only on those elements of the training time series ( , )j jyx whose Lagrange multipliers have ob- tained positive values 0j  . It is natural to call the feature vectors of the respective ob- jects the support vectors. So, we have come to some generalization of the Support Vector Machine [1] which follows from the concept of the optimal discriminant hyperplane (1).
  4. 4. The classical training problem is a particular case of the problem (2) when the penalties on the time variation of the hyperplane parameters infinitely grow D a and b D . In this case, we have jl Q I , 0jlf  , and the dual problem (3) turns into the classical dual prob- lem [1] which corresponds to the initial prob- lem (1): 1 1 1 1 1 ( ,..., ) 1 ( ) max, 2 0, 0 2, 1,..., , N N N N T j j l j l j l j j l N j j j j W y y y C j N                      x x The formulas (4) and (5) will determine the training result in accordance with the classical support vector method 1 ˆ ˆ ˆ... N  a a a and 1 ˆ ˆ ˆ... Nb b b   : : 0 ˆ j j j j j y    a x , : 2:0 2 :0 2 : 0 ' '' , ' , 2 '' . jj j l j j T j l l l j j Cj C j C l b b C b b y b y                   x x Despite the fact that the dual problem (3) is not pair-wise separable, the pair-wise separability of the initial problem (2) makes it possible to compute the gradient of the goal function 1( ,..., )NW   at each point 1( ,..., )N   and then to find the optimal admissible max- imization direction relative to the constraints via an algorithm of the linear computational complexity with respect to the length of the training time series. In particular, the standard steepest descent method of solving quadratic programming problems [3], being applied to the function 1( ,..., )NW   , yields a generali- zation of the known SMO algorithm (Sequen- tial Minimum Optimization) [4] which is typi- cally used when solving dual problems in SVM. References 1. Vapnik V. Statistical Learning Theory. John-Wiley & Sons, Inc. 1998. 2. Wiener N. Extrapolation, Interpolation, and Smooth- ing of Stationary Random Time Series with Engi- neering Applications. Technology Press of MIT, John Wiley & Sons, 1949, 163 p. 3. Bazaraa M.S., Sherali H.D., Shetty C.M. Nonlinear Programming: Theory and Algorithms. John Wiley & Sons, 1993. 4. Platt J.C. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999.