Inference from aging information


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Inference from aging information

  1. 1. IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010 1015Brief Papers Inference From Aging Information We here consider the scenario of online learning, where information arrives sequentially and learning is incremental. The central point be- Evaldo Araújo de Oliveira and Nestor Caticha hind Bayesian learning is that the posterior probability distribution of old data is used as the prior for the inclusion of arriving new data. In this brief, we look at the case where part of the previous knowl- edge is that the data can age. This brings in the problem of time. Even Abstract—For many learning tasks the duration of the data collection canbe greater than the time scale for changes of the underlying data distribu- in the case where no new information has arrived, as time goes by, thetion. The question we ask is how to include the information that data are old posterior distribution may no longer reflect faithfully what can beaging. Ad hoc methods to achieve this include the use of validity windows used as our prior knowledge. While Bayes theorem is still the way tothat prevent the learning machine from making inferences based on old incorporate a new datum, the knowledge that old information may notdata. This introduces the problem of how to define the size of validity win-dows. In this brief, a new adaptive Bayesian inspired algorithm is presented be useful as it once was, is additional prior information and thus liablefor learning drifting concepts. It uses the analogy of validity windows in an to change the usual posterior as the new prior.adaptive Bayesian way to incorporate changes in the data distribution over The question we ask is how these ideas can help in constructing atime. We apply a theoretical approach based on information geometry to classification learning algorithm in time-dependent environments.the classification problem and measure its performance in simulations. The Of course this topic has been addressed before; see, e.g., [2]–[7].uncertainty about the appropriate size of the memory windows is dealt within a Bayesian manner by integrating over the distribution of the adaptive We are not after a learning model that will substitute these approacheswindow size. Thus, the posterior distribution of the weights may develop in a practical situation. Our aim is to investigate these questions in aalgebraic tails. The learning algorithm results from tracking the mean and theoretical setup in order to be able to point out characteristics thatvariance of the posterior distribution of the weights. It was found that the should be present in successful algorithms and identify them from aalgebraic tails of this posterior distribution give the learning algorithm theability to cope with an evolving environment by permitting the escape from solid theoretical perspective. We then apply the theoretical method inlocal traps. a simple model to construct a learning algorithm whose properties can be readily identified. Index Terms—Online Bayesian algorithms, pattern classification, time- In Section II, we present the essentials of the learning method andvarying environment. the idea of how to modify the prior distribution in order to take into account the nonstationary behavior of the rule. Next and in order to I. INTRODUCTION obtain quantitative theoretical results, we restrict to a particular model. The simplest possible is that of a single-layer perceptron. We examine Probability theory gives us the tools to deal with assertions in the the resulting learning algorithm in a perceptron for different kinds ofabsence of complete information. A central problem is how to update source ageing such as drifting or sudden changes (Section III). Dis-probabilities assignments once new information is obtained. The basic cussions and concluding remarks appear in Section IV.intrinsically related methods of probability upgrade are the Bayesianupdate, based on Bayes theorem, and the maximum entropy method[1]. There are many advantages in using them; an obvious one is the II. LEARNING SCENARIOperformance, but perhaps, the main one is the conceptual unifying Consider a supervised classification problem and a learning process defined on a model parametrized by ! 2 <N . Information orframework that these methods lend to the problem of inference. Thefact that sometimes they are not easily implementable has led to a input–output pairs, which arrive in a sequential order, are denoted by y = ( ; ), composed by an input 2 N and a supervisedset of approximate techniques, which sometimes resemble particular classification label 2 M or 2 f01; +1g. We make themethods of statistics derived by other means. While the need for priordistributions might be seen as a problem by some practitioners, it is amajor advantage to others. The uneasy feeling felt by the former while simplifying assumption that each example y is used only once andcodifying prior knowledge into probability distributions is a measure then discarded.of the amount of the research needed to include different forms in We will study this problem within a scenario where an underlyingwhich prior information might be presented in a natural form. The unknown rule parametrized by ! furnishes the example classificationadvantages of having useful prior probabilities has fueled the search label T! : ! . The learning machine will classify the input for better methods to determine prior distributions. Far from complete, vector according to S! : ! , where ! represents the internal ^ ^ ^these methods do not exhaust the realm of possibilities which the form state in the current state of knowledge.of new information may represent. The arrival of a new information datum y+1 elicits a change in the state that can be written as Manuscript received May 11, 2009; revised March 13, 2010; accepted March ! +1 = ! + F (y+1 ; ! ) ^ ^ ^ (1)15, 2010. Date of publication April 22, 2010; date of current version June 03,2010. The work of E. A. de Oliveira was supported by the Fundação de Amparo where , a learning rate, sets the scale of changes determined by a par-de Apoio à Pesquisa do Estado de São Paulo (FAPESP). ticular algorithm coded by F , which will be referred to as the modula- E. A. de Oliveira is with the Institute of Mathematics and Statistics, University tion function. A figure of merit for F has to be introduced and the rest of thisof São Paulo, São Paulo 05508-090, Brazil (e-mail: N. Caticha is with the Institute of Physics, University of São Paulo, São Paulo05508-090, Brazil (e-mail: brief deals with the choice in the scenario of aging information. Dif- Digital Object Identifier 10.1109/TNN.2010.2046422 ferent choices may reflect the modelers particular taste and this great 1045-9227/$26.00 © 2010 IEEE Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
  2. 2. 1016 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010variety imposes dealing with the question of what principles should be ! vector with respect to ^ . No information about whether the rule is sta-the guides in its design. tionary has been included in constructing this algorithm. But once we We here present the design of adaptive algorithms that track evolving receive information that we are dealing with nonstationary rules, werules in a Bayesian inspired way. We follow Opper’s approach for static can modify it in a Bayesian suggested manner.rules [8]. The main idea is to update a probabilistic model in a Bayesian The learning rate (1) is a particularly important parameter in non-way. Any predictive use of the full learning model will include either an stationary environments. If the dynamics are expected to converge, thenoptimization problem, e.g., the maximum posterior, or a multidimen- should follow an annealing schedule, decreasing typically during thesional integration, e.g., Bayes posterior average. These are manifestly learning process with some inverse power of the data set size. Sincecomputationally expensive and approximations are needed in order to this is not useful in changing environments, many authors have sug-yield a manageable algorithm. gested mechanisms to update or to define a trustful time window [3], D Consider the data set which consists of all examples f = y [11]–[13]. Different heuristic arguments have lead to the construction ; ( )g=1;... that arrived up to time step . Let any prior knowl- of such algorithms. We now introduce Bayesian ideas to help in such P!edge be codified by the distribution ( ), the prior distribution of ! a construction. We first discuss this from a heuristic energy-like point PD !and call ( j ) the likelihood of observing the data, constructed of view, trying to see how aging influences the prior distribution. Then,from the knowledge of the model and the noise process. Using Bayes we present this from a Bayesian perspective. This is quite natural fromtheorem, the posterior distribution the perspective of statistical mechanics, where energy plays a leading P (! j D ) = dNP (! )(P ()P(jD)j ! ) D ! role, since this was the first fully Bayesian theory. The relevant piece !P ! 0 0 0 (2) of information is that the probability distribution is related to the Boltz- mann factor, the exponential of (minus) the energy.can be constructed in the usual way. An assumption in Bayesian If learning algorithms are defined by energy or cost functions, we ylearning is that, when a new example +1 arrives, it makes sense to can think of the change in energy of the data set when a new data ex- P!Duse the old posterior ( j ) as the new prior. Once we accept this, y E ample +1 arrives. The energy of the old examples is related toit follows that the prior distribution since it describes the current state of the system. P (! j y+1 ; D) = dN !!Pj(Dj)D(+1j! ; D! ;+1) ) P( P ; The influence of old examples should decrease, say by a factor . ! )P ( +1 j ;D ;+1 The new example should contribute a cost or energy term +1 so V (3) E V E +1 = +1 + . The energy of the new example +1 is related V 0 0 0for statistically independent examples P (y+1 j D ) = P (y+1 ). The to the likelihood. The posterior update, due to aging, should thus beextended data set, after the information arrival, is D+1 = D [ y+1 . V changed to something like: S+1 / fG g expf0 +1 g, for a prior G . It includes both forgetting, through the decrease of the energy term,Then, the probability update is given by V D P ! 1 P (! j D+1 ) = dNP (! j(! j)D()P +1 j +1;!+;) ) : (4) and learning, through the term, about the new example. But what value should have? Should it be fixed? These questions !P 0 0 ( j +1 0 are better answered if we use our favorite inference method. Thus,Incorporate—the last posterior PG (! j D ) is updated according to 0, a parameter that influences the posterior, has itself a probability density whose parameters are updated as information arrives. The prior PS (! j D+1) = dNPGP! j(! j)D()P +1 j +1;!+;) ) : (5) ! G ( D P ! 1 ( j will now be a distribution G ( !; ! ) of both and . The influence of +1 will be felt by setting the overall scale of change of the prior or the for- 0 0 0 getting part of learning. Suppose that we are dealing with a Gaussian P !DProject—the posterior S ( j ) is projected to the parametric space family G . Then, forgetting is about changing the width of the prior dis-G tribution. In principle, we could define a change in the full covariance, PS (! j D+1 ) 0! PG (! j D+1 ) N N an 2 matrix. This means that in general we should follow the up- (6) N= date of O( 2 2) parameters. We will study a simplification that has PSDKL (S+1; G+1 ) = dN ! PS (! j D+1 ) ln PG(! jj D+1 ) : (! D+1 ) (7) been seen to be efficient for static rules [14]. It consists of looking at an approximation where the covariance is set to a multiple of the unitOf course the two steps can be seen as just a change from a given prior N matrix, with only O( ) parameters to be updated. We will refer to theto a new one from the same family, and thus, the only bookkeeping resulting algorithms as scalar. There is only one relevant scale that de- scribes forgetting. that is necessary is to calculate the change in the parameters that de-fine them. A convenient choice of G will be dictated by the particular There are many possible choices for the distribution and we haveproblem under consideration. For example, Solla and Winther [9] have not exhausted their investigation. The research about what turns out to be a useful prior distribution is a major area in Bayesian inference. A Plooked at cases where the weights are restricted to a discrete set of reasonable prior distribution family follows from imposing that ( ) values and in so doing have been able to introduce an efficient online is such that hln i and h i have finite values, ensuring that 0. strategy even when learning a problem with two state weights. In thisbrief, we restrict ourselves to the case where G is the Gaussian family. The reason we impose that hln i has a finite values is that if had a At this point the learning algorithm has been defined just by the large probability of being zero, it would lead to a total loss of the prior information, which seems undesirable. Our algorithm at the end will changes in the average and covariance of the distributions [8] leading choose the effective value of ; the imposition its distribution is zero to the Bayesian online algorithm (BOnA) [10] at the origin does not rule out small values of , which will occur if the ! +1 = ! 0 C 1 r! E ^ ^ ^ (8) machine starts making many errors, as will be the case when the rule C+1 = C 0 C 1 r! r! 1 C E ^ ^ (9) suddenly changes. These constraints are reasonable, but we have not claimed to prove they are unique.with ! being the average and C being the covariance matrix. This ^ Given these constraints, we determine the family of the prior usingis just a type of gradient-descent learning with the energy function the method of maximum entropy. This method permits identifying theE 0 lnhP (+1 j u + ! )i, where h. . .i means that the average ^ assumptions behind the specific form of the prior family. We maximizeis over the Gaussian variable u N (0; C ) and r! is the gradient S ^ P P d = 0 ( ) ln ( ) under those constraints. This leads to the Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
  3. 3. IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010 1017introduction of Lagrange parameters a and b and the resulting prior is quite interesting that this exercise will point to general characteris-family is the gamma distribution: P () = (a01 e0=b )=(0(a)ba ). tics which successful algorithms should have when learning from agingThe reason for imposing these two constraints is to have control of information. We also perform numerical simulations of the learningsmall values of through a, and of large values of through b. The scenario. One can measure a merit function such as the generalizationvalues of a and b will be determined adaptively by using the arriving error eg , given by the probability of making a classification, on an in-information, as shown below. dependent example vector, different from the correct label used in the The family G of joint distributions we consider is learning. Defining 4 hP (yt+1 j ! ; )iG in order to simplify the notation, e 0k! 0! k ^ =2 a01 e0=b we have G (! ; ) = : (10) 1 1 0k! 0! k ^ =2 a01 e0=b (2=)N=2 0(a)ba N e 4= d d ! 0 01 (2=)N=2 0(a)baThe algorithm is then obtained by using a member of G as the prior. 1 The inclusion of a new example drives the distribution to the posterior, 2 dxF (x) x 0 +1 ! 1 p+1say S+1 , which does not in general belong to G . Again, the central 01 Nquestion is: At the next step, what should be the new prior distribu- (2 ) 01=2 1 1 a01=2 = a dxF (x) dtion? It is to be answered under the constraint that it belongs to G . The 0(a)b 01 0best prior, in the sense of minimal information loss, is obtained by min- 2 expf0[1=b + (x 0 )2 =2]gimizing DKL (S+1 ; G+1 ). But since we are looking at a parametric 1 0a01=2 0(a + 1=2) 1 1 2family, we just need to look at the update of the parameters that specify = p dxF (x) + (x 0 )a member of G . This leads to the new f! ; a; bg, obtained from the set ^ a 2 0(a)b 01 b 2of p h!i iS where 1 ! N and = +1 ! 1 +1 = N . Since a 1 is fixed ^ i !+1 ^ = (11) to an integer value, the last integral can be written as the (a 0 1)th hiS derivative with respect to 1=b 9(a+1 ) 0 ln(b+1 ) = hln()iS (12) 0(a + 1=2) (02) a01 @ a01 1 a+1 b+1 = hiS p dxF (x) (13) 4= 2 0(a)b (2a 0 1)!! a 0 a01 @ b 1 01where the Digamma function is 9(x) = dfln 0(x)g=dx. 1 1 03=2 To proceed further with the analysis and interpretation of the results 2 2 + (x 0 ) (16)we reduce the number of dynamical variables. Thus, we fix a = a to b 2a constant value for all . Large or small values of a dictate whetherthe probability of small values of is high or low. The first and second which for F (x) = 2(x) can be easily calculatedmomenta of G()1 are a+1 b+1 and a+1 b2 +1 , respectively. This 0(a + 1=2) (02) a01 4= p (17)can be used to obtain, from (11)–(13), that the learning algorithm is 0(a)b (2a 0 1)!! a @ a01given by 1 b ! +1 ^ = ! + ^ @! ^ lnhP (y+1 j ! ; )iG (14) 0 a01 b 1+ : (18) ab+1 @ b 1 2 2 + b 2 b b+1 = b + @b lnhP (yt+1 j ! ; )iG (15) Very small values of a should represent cases where very rapid for- awhich is also a gradient-descent algorithm, along the gradient of getting of the data should occur, since the greatest contribution to thelnhP (yt+1 j ! ; )iG . This is a general algorithm, fixed by the like- mass of the density would be at the origin. Conversely for large valueslihood, which is itself determined by the choice of the model, the of a, little forgetting occurs. We choose to study now a = 2, for twoarchitecture, and eventual noise corruption. reasons: first, because it is easy to treat, and second, because it repre- We could use it now in the analysis of a real problem and compare it sents an intermediate value. It results into other algorithms. We are, however, more interested in investigating 1 b 1 4= 1+ 1+ 2 (19)the properties of the algorithm itself and understanding how it adapts 2 2 2 + b 2 + b to varying environments. We do this by looking at how it works in a that with (14) and (15) gives the new Bayesian online adaptive scalarsimple case, rich enough to reveal the details, yet sufficiently simple (BOAS) algorithmthat we can work out the mathematical details. In Section III, we apply +1 F! b ! +1 ^ = ! + ^ (20)the algorithm (14)–(15) to the simplest classifier, the preceptron with 2b+1 N +1no hidden units. 1 b+1 = 1+ Fb b (21) 4 III. AN APPLICATION: THE PERCEPTRON where F! = 40 =4; Fbp = b 40 =4; 40 2 05=2 = 3(2 + b ) , and p The likelihood for the single-layer Boolean perceptron can be written +1! 1 +1 = N . = ^as P (yt+1 j ! ; ) = F (+1! 1 +1 = N ). If there is no noise, then Note that is an interesting variable since the output for an ex-the probability is either one or zero, depending on whether the example ample is the sign of and so is large in absolute value when thewas correctly classified, so F (x) = 2(x), the step function, which sign is clear. It is positive if the classification is correct and negative ifis the case we now study [10]. It is extremely simple at this point to not. Fig. 1 shows the form of the changes in the weights and in b+1include different noise processes and study their effects. This would as a function of a rescaled . We now apply this algorithm in differenthowever lead us astray from our present purpose, which is to analyze cases. In the first case, the rule suddenly changes to a new one, uncor-the simple algorithm that will result in a case of nonstationary rules. It related to the previous one. This tests if the effective annealing of the 1We can write (10) as G (! ; ) = G(! j )G() where the prior is G() = algorithm is able to cope with several changes or if it is only able to e =0(a)b . learn the first few cases while growing older for subsequent changes. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
  4. 4. 1018 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010Fig. 1. (a) Effective modulation function as a function of the rescaled field b . (b) Change in the effective window size as a function of b . Fig. 3. Generalization error as a function of D for the perceptron learning a random walk (circles) and a run away rule (triangles). The curves were obtained for N = 100 and averaged over 100 runs and they give e 0:36D for the random walk rule and e 0:54D for the run away rule, with standard O deviation (10 ) in all coefficients and exponents. N -dimensional unit sphere and ! running away from ! . Both cases ^ can be written as ! +1 = 1 0 3+1 ! + 1 +1 : (22) N N In the random walk case, the components of the drift vector were drawn from a Gaussian distribution N (0; D 1 I ), independently hi ()j ( )i = 2Dij (23)Fig. 2. Generalization error as function of the number of examples ( = =N )for the BOAS learning rules under sudden abrupt changes. The simulations were with I as an identity matrix, 3+1 = +1 1 ! + D , and initial performed for N = 100 and averaged over 100 runs (error bars are smaller than condition ! 0 1 ! 0 = 1, to ensure normalization up to the first order in 1=N .the symbol size.) For the runaway behavior, following [4], we imposedNext we model change of the rule as a random walk and see how thesystem is able to track the drifting rule. 2D 0 (D=N )2 +1 = 0 1 0 2 ! =k! k ^ ^ (24) 3+1 = D=N 0 k +1 kA. Sudden Changes and Drift The perceptron learns a constant rule defined by another perceptron where = ! 1 ! =k! kk! k. ^ ^with a fixed weight vector ! using algorithm (20)–(21). We study three The asymptotic performance of the algorithm is shown in Fig. 3, withscenarios, one of abrupt separation by static periods, another where a residual generalization error of eg 0:36D0:24 eg 0:36D1=4 forthere is a constant random drift or when the rule tries to escape from the random walk drift case and eg 0:54D0:21 eg 0:54D1=5 forthe classifier. the runaway case. This means that the BOAS algorithm reaches a per- An abrupt change occurs after a certain amount of P = N exam- formance similar to the tensorial BOnA [15] and the optimal algorithmples have been presented. A new, uncorrelated vector ! is chosen. The (OA) obtained by a variational approach [6].performance of the BOAS algorithm, measured by the generalizationerror, is shown in Fig. 2, which shows a complete relearning of new IV. DISCUSSIONrules, with the same efficiency as the first one. Naturally, if such abrupt changes occurred at every time step, it For the particular case of the single-layer perceptron, we can under-would not be possible to learn anything, so we applied the BOAS stand heuristically the features that make the learning algorithm effi-algorithm to drifting scenarios [4], [6], [15] to observe its capacity cient. Fig. 1 is particularly interesting in showing how the algorithmof learning under continuous and moderate changes. We consider works and which features will most likely be found in more complextwo cases: with ! performing a random walk on the surface of an learning scenarios where information ages. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
  5. 5. IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010 1019 predictions about thermodynamic properties of physical systems, based on the available information about expected values of energy. When the system evolves and data turn old, and old data become less impor- tant, the energy includes a forgetting effect, measured by , and a term to include new information: E+1 = E + V+1 . We started with a Gaussian but admitted that its width might change in an unknown fashion. This lack of knowledge is dealt with by introducing a proba- bility density for its (inverse) width. The integration over the different possible widths leads in turn to a power law, since averaged over all possible time windows, the marginal distribution is 0a01=2 k! 0 ! k2 a G (! ) = dG (! ; ) / 1+ ^ : b A Gaussian prior would tend to eliminate very strongly weight vec-Fig. 4. Generalization error as a function of the number of examples ( = tors that are too far from the current mean. The ability to escape from=N ) for the perceptron learning a stationary rule. The simulations were per- current knowledge when the rule changes hinges on the prior not beingformed for N = 100 and averaged over 100 runs and the solid line representsthe curve e = 0:88=. too close to zero for the new rule. The algorithm uses errors as signals that lead to an increase on the width of the prior, helping to decrease perseverance. Of course, if the prior is set to zero at a particular weight vector, then no amount of evidence will permit learning it. If there is a TABLE I ASYMPTOTIC GENERALIZATION ERROR FOR THE RANDOM WALK (RW) possibility of being wrong, and this is probably true for any situation, AND RUN AWAY (RA) DYNAMICS the prior should not be set arbitrarily close to zero. In our example, the willingness to accept the possibility of being wrong is manifest through the development of heavy tails. The need to determine the form of prior distributions from different types of prior information will increase as the success of current methods becomes more evident. Every possibility of new information should be reflected on properties of the prior. If this were demonstrably not true for a particular type of information it would be an important development. In Fig. 1(a), we have the effective modulation function as a function We have proposed a simple, but by no means complete, way to in-of the rescaled field b . It determines the size of the updates of the clude aging in a Bayesian way. For a simple machine, we studied theweights in the direction of the example vector. For positive , when new algorithm quite thoroughly. We showed that it learns the rule inthe example is correctly classified, a very small change is made. As b the stationary case (Fig. 4) [16] and is able to track down changes ingrows larger, which will tend to happen when the generalization error the environment by forgetting in a Bayesian inspired way. These theo-decays, this correction gets smaller. For negative , when the example retical results should not be judged by the application we present misclassified, a very large modification is done. But if the error is The fact that on such a simple problem it performs very well shouldtoo great, the size of the correction decreases as if the confidence in not be confused with the proof of having built a practical algorithm.the supervision decreased. Fig. 1(b) shows the update of the effective The main advantage of our method is to identify desired features ofwindow size (b+1 )=(b ) as a function of . The effective width of adaptive algorithms for nonstationary environments which follow fromthe posterior varies inversely with and so with b . For positive quite general information theory considerations.note that (b+1 )=(b ) 1, which means that if the predicted label iscorrect, then the effective width of the prior is reduced. Getting it rightmakes the machine surer about the rule. On the other hand, for 0, ACKNOWLEDGMENT(b+1 )=(b ) 1, which means that the effective width of the prior is The authors would like to thank R. Alamino and M. Opper for valu-increased. An error indicates that the rule might have changed and in able discussions.order to capture the new rule a wider prior will be needed. Learningoccurs by changing the mean of the prior and decreasing its width.Relearning occurs by widening the prior and making larger steps when REFERENCESchanging the mean. [1] A. Giffin and A. Caticha, “Updating probabilities with data and mo- The scope of the method we have studied, however, extends be- ments,” in Proc. Bayesian Inference Maximum Entropy Methods Sci. Eng. Conf., K. H. Knuth, A. Caticha, J. L. Center, A. Giffin, and C. C.yond the simple case of linear separability. We introduced a method Rodriguez, Eds., Nov. 2007, vol. 954, pp. 74– take into account the possibility that data loses information con- [2] J. Schlimmer and R. G. Jr, “Incremental learning from noisy data,”tent due to the time evolution of the underlying problem. We discussed Mach. Learn., vol. 1, no. 3, pp. 317–354, 1986.first the problem in terms of energy-like terms and then introduced a [3] I. Koychev and R. Lothian, “Tracking drifting concepts by timeBayesian method. However ill defined the ideas of energy in this con- window optimisation,” in Proc. Res. Develop. Intell. Syst. XXII/25th SGAI Int. Conf. Innovative Tech. Appl. Artif. Intell., M. Bramer, F.text sound to the reader, it should not be forgotten that historically, Coenen, and T. Allen, Eds., 2006, pp. 46–59.probably the major example of the use of Bayesian ideas can be found [4] M. Biehl and H. Schwarze, “Learning drifting concepts with neuralin the founding works of statistical mechanics, a theory used to make networks,” J. Phys. A, Math. Gen., vol. 26, pp. 2651–2665, June 1993. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
  6. 6. 1020 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010 [5] O. Kinouchi and N. Caticha, “Lower bounds on generalization errors Discriminant Analysis for Fast Multiclass Data for drifting rules,” J. Phys. A, Math. Gen., vol. 26, pp. 6161–6171, Nov. Classification Through Regularized Kernel 1993. [6] R. Vicente, O. Kinouchi, and N. Caticha, “Statistical mechanics of Function Approximation online learning of drifting concepts: A variational approach,” Mach. Santanu Ghorai, Anirban Mukherjee, and Pranab K. Dutta Learn., vol. 32, pp. 179–201, Aug. 1998. [7] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. Cambridge, U.K.: Cambridge Univ. Press, Mar. 2006. [8] M. Opper, “A Bayesian approach to on-line learning,” in On-Line Learning in Neural Networks, D. Saad, Ed. Cambridge, U.K.: Abstract—In this brief we have proposed the multiclass data classi- Cambridge Univ. Press, Apr. 1998, pp. 363–378. fication by computationally inexpensive discriminant analysis through [9] S. Solla and O. Winther, “Optimal perceptron learning: An online vector-valued regularized kernel function approximation (VVRKFA). Bayesian approach,” in On-line Learning in Neural Networks, D. VVRKFA being an extension of fast regularized kernel function approxi- mation (FRKFA), provides the vector-valued response at single step. The Saad, Ed. Cambridge, U.K.: Cambridge Univ. Press, Apr. 1998, pp. VVRKFA finds a linear operator and a bias vector by using a reduced 379–398. kernel that maps a pattern from feature space into the low dimensional [10] N. Caticha and E. A. de Oliveira, “Gradient descent learning in and label space. The classification of patterns is carried out in this low dimen- out of equilibrium,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Relat. sional label subspace. A test pattern is classified depending on its proximity Interdiscip. Top., vol. 63, no. 6, p. 061905, 2001. to class centroids. The effectiveness of the proposed method is experi- [11] G. Widmer and M. Kubat, “Learning in the presence of concept drift mentally verified and compared with multiclass support vector machine and hidden contexts,” Mach. Learn., vol. 23, pp. 69–101, Apr. 1996. (SVM) on several benchmark data sets as well as on gene microarray data [12] J. Gama, P. Medas, G. Castilho, and P. Rodrigues, “Learning with drift for multi-category cancer classification. The results indicate the significant detection,” in Proc. 17th Brazilian Symp. Artif. Intell., Berlin, Germany, improvement in both training and testing time compared to that of mul- Nov. 2004, vol. 3171, pp. 285–295. ticlass SVM with comparable testing accuracy principally in large data [13] M. Murata, M. Kawanabe, A. Ziehe, K. Müller, and S. Amari, sets. Experiments in this brief also serve as comparison of performance of “On-line learning in changing environments with applications in VVRKFA with stratified random sampling and sub-sampling. supervised and unsupervised learning,” Neural Netw., vol. 15, pp. Index Terms—Discriminant analysis, function approximation, multiclass 743–760, Jun. 2002. data classification, support vector machine (SVM). [14] E. A. de Oliveira and R. C. Alamino, “Performance of the Bayesian online algorithm for the perceptron,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 902–905, May 2007. [15] E. A. de Oliveira, “The Rosenblatt Bayesian algorithm learning in a I. INTRODUCTION nonstationary environment,” IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 584–588, Mar. 2007. The multiclass data classification is important in many applications [16] O. Kinouchi and N. Caticha, “Optimal generalization in perceptrons,” such as text classification, optical character recognition, speaker recog- J. Phys. A, Math. Gen., vol. 25, pp. 6243–6250, 1992. nition, diagnosis of diseases in medical study, etc. The multiclass clas- sification approaches available in literature can be principally parti- tioned into two groups. The first group of algorithms gets natural ex- tension from its binary predecessor. Algorithms such as linear discrim- inant analysis (LDA) [21, pp. 106–119], nearest neighborhoods [21, pp. 463–483], regression and decision trees including C4.5 [36] and CART [7], neural networks [12], etc., belong to this category. The second group consists of methods that involve decomposition of multi- class problems into a number of binary classification problems [1]. The state-of-the-art support vector machine (SVM) [11], [41] being a binary classifier fits into the second group. A number of binary to multiclass extension approaches, such as one versus rest (OVR) [6], one versus one (OVO) [25], directed acyclic graph SVM (DAGSVM) [35], and all at once [10], [43], are available in literature. Hsu and Lin made a com- prehensive comparison of these methods in their paper [22]. The dif- ference among all these methods lies in the training time, testing time, Manuscript received June 01, 2009; revised January 22, 2010 and March 16, 2010; accepted March 18, 2010. Date of publication April 22, 2010; date of current version June 03, 2010. This work was supported by SERC Fast Track project (SR/FT/ET-014/2009) of the Department of Science and Technology, Government of India. S. Ghorai was with the Department of Electrical Engineering, Indian Insti- tute of Technology, Kharagpur-721302, West Bengal, India. He is now with the Department of Electronics and Communication Engineering, MCKV In- stitute of Engineering, Liluah, Howrah-711204, West Bengal, India (e-mail: A. Mukherjee and P. K. Dutta are with the Department of Electrical Engi- neering, Indian Institute of Technology, Kharagpur-721302, West Bengal, India (e-mail:; Color versions of one or more of the figures in this brief are available online at Digital Object Identifier 10.1109/TNN.2010.2046646 1045-9227/$26.00 © 2010 IEEE Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.