1016 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010variety imposes dealing with the question of what principles should be ! vector with respect to ^ . No information about whether the rule is sta-the guides in its design. tionary has been included in constructing this algorithm. But once we We here present the design of adaptive algorithms that track evolving receive information that we are dealing with nonstationary rules, werules in a Bayesian inspired way. We follow Opper’s approach for static can modify it in a Bayesian suggested manner.rules . The main idea is to update a probabilistic model in a Bayesian The learning rate (1) is a particularly important parameter in non-way. Any predictive use of the full learning model will include either an stationary environments. If the dynamics are expected to converge, thenoptimization problem, e.g., the maximum posterior, or a multidimen- should follow an annealing schedule, decreasing typically during thesional integration, e.g., Bayes posterior average. These are manifestly learning process with some inverse power of the data set size. Sincecomputationally expensive and approximations are needed in order to this is not useful in changing environments, many authors have sug-yield a manageable algorithm. gested mechanisms to update or to deﬁne a trustful time window , D Consider the data set which consists of all examples f = y –. Different heuristic arguments have lead to the construction ; ( )g=1;... that arrived up to time step . Let any prior knowl- of such algorithms. We now introduce Bayesian ideas to help in such P!edge be codiﬁed by the distribution ( ), the prior distribution of ! a construction. We ﬁrst discuss this from a heuristic energy-like point PD !and call ( j ) the likelihood of observing the data, constructed of view, trying to see how aging inﬂuences the prior distribution. Then,from the knowledge of the model and the noise process. Using Bayes we present this from a Bayesian perspective. This is quite natural fromtheorem, the posterior distribution the perspective of statistical mechanics, where energy plays a leading P (! j D ) = dNP (! )(P ()P(jD)j ! ) D ! role, since this was the ﬁrst fully Bayesian theory. The relevant piece !P ! 0 0 0 (2) of information is that the probability distribution is related to the Boltz- mann factor, the exponential of (minus) the energy.can be constructed in the usual way. An assumption in Bayesian If learning algorithms are deﬁned by energy or cost functions, we ylearning is that, when a new example +1 arrives, it makes sense to can think of the change in energy of the data set when a new data ex- P!Duse the old posterior ( j ) as the new prior. Once we accept this, y E ample +1 arrives. The energy of the old examples is related toit follows that the prior distribution since it describes the current state of the system. P (! j y+1 ; D) = dN !!Pj(Dj)D(+1j! ; D! ;+1) ) P( P ; The inﬂuence of old examples should decrease, say by a factor . ! )P ( +1 j ;D ;+1 The new example should contribute a cost or energy term +1 so V (3) E V E +1 = +1 + . The energy of the new example +1 is related V 0 0 0for statistically independent examples P (y+1 j D ) = P (y+1 ). The to the likelihood. The posterior update, due to aging, should thus beextended data set, after the information arrival, is D+1 = D [ y+1 . V changed to something like: S+1 / fG g expf0 +1 g, for a prior G . It includes both forgetting, through the decrease of the energy term,Then, the probability update is given by V D P ! 1 P (! j D+1 ) = dNP (! j(! j)D()P +1 j +1;!+;) ) : (4) and learning, through the term, about the new example. But what value should have? Should it be ﬁxed? These questions !P 0 0 ( j +1 0 are better answered if we use our favorite inference method. Thus,Incorporate—the last posterior PG (! j D ) is updated according to 0, a parameter that inﬂuences the posterior, has itself a probability density whose parameters are updated as information arrives. The prior PS (! j D+1) = dNPGP! j(! j)D()P +1 j +1;!+;) ) : (5) ! G ( D P ! 1 ( j will now be a distribution G ( !; ! ) of both and . The inﬂuence of +1 will be felt by setting the overall scale of change of the prior or the for- 0 0 0 getting part of learning. Suppose that we are dealing with a Gaussian P !DProject—the posterior S ( j ) is projected to the parametric space family G . Then, forgetting is about changing the width of the prior dis-G tribution. In principle, we could deﬁne a change in the full covariance, PS (! j D+1 ) 0! PG (! j D+1 ) N N an 2 matrix. This means that in general we should follow the up- (6) N= date of O( 2 2) parameters. We will study a simpliﬁcation that has PSDKL (S+1; G+1 ) = dN ! PS (! j D+1 ) ln PG(! jj D+1 ) : (! D+1 ) (7) been seen to be efﬁcient for static rules . It consists of looking at an approximation where the covariance is set to a multiple of the unitOf course the two steps can be seen as just a change from a given prior N matrix, with only O( ) parameters to be updated. We will refer to theto a new one from the same family, and thus, the only bookkeeping resulting algorithms as scalar. There is only one relevant scale that de- scribes forgetting. that is necessary is to calculate the change in the parameters that de-ﬁne them. A convenient choice of G will be dictated by the particular There are many possible choices for the distribution and we haveproblem under consideration. For example, Solla and Winther  have not exhausted their investigation. The research about what turns out to be a useful prior distribution is a major area in Bayesian inference. A Plooked at cases where the weights are restricted to a discrete set of reasonable prior distribution family follows from imposing that ( ) values and in so doing have been able to introduce an efﬁcient online is such that hln i and h i have ﬁnite values, ensuring that 0. strategy even when learning a problem with two state weights. In thisbrief, we restrict ourselves to the case where G is the Gaussian family. The reason we impose that hln i has a ﬁnite values is that if had a At this point the learning algorithm has been deﬁned just by the large probability of being zero, it would lead to a total loss of the prior information, which seems undesirable. Our algorithm at the end will changes in the average and covariance of the distributions  leading choose the effective value of ; the imposition its distribution is zero to the Bayesian online algorithm (BOnA)  at the origin does not rule out small values of , which will occur if the ! +1 = ! 0 C 1 r! E ^ ^ ^ (8) machine starts making many errors, as will be the case when the rule C+1 = C 0 C 1 r! r! 1 C E ^ ^ (9) suddenly changes. These constraints are reasonable, but we have not claimed to prove they are unique.with ! being the average and C being the covariance matrix. This ^ Given these constraints, we determine the family of the prior usingis just a type of gradient-descent learning with the energy function the method of maximum entropy. This method permits identifying theE 0 lnhP (+1 j u + ! )i, where h. . .i means that the average ^ assumptions behind the speciﬁc form of the prior family. We maximizeis over the Gaussian variable u N (0; C ) and r! is the gradient S ^ P P d = 0 ( ) ln ( ) under those constraints. This leads to the Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010 1017introduction of Lagrange parameters a and b and the resulting prior is quite interesting that this exercise will point to general characteris-family is the gamma distribution: P () = (a01 e0=b )=(0(a)ba ). tics which successful algorithms should have when learning from agingThe reason for imposing these two constraints is to have control of information. We also perform numerical simulations of the learningsmall values of through a, and of large values of through b. The scenario. One can measure a merit function such as the generalizationvalues of a and b will be determined adaptively by using the arriving error eg , given by the probability of making a classiﬁcation, on an in-information, as shown below. dependent example vector, different from the correct label used in the The family G of joint distributions we consider is learning. Deﬁning 4 hP (yt+1 j ! ; )iG in order to simplify the notation, e 0k! 0! k ^ =2 a01 e0=b we have G (! ; ) = : (10) 1 1 0k! 0! k ^ =2 a01 e0=b (2=)N=2 0(a)ba N e 4= d d ! 0 01 (2=)N=2 0(a)baThe algorithm is then obtained by using a member of G as the prior. 1 The inclusion of a new example drives the distribution to the posterior, 2 dxF (x) x 0 +1 ! 1 p+1say S+1 , which does not in general belong to G . Again, the central 01 Nquestion is: At the next step, what should be the new prior distribu- (2 ) 01=2 1 1 a01=2 = a dxF (x) dtion? It is to be answered under the constraint that it belongs to G . The 0(a)b 01 0best prior, in the sense of minimal information loss, is obtained by min- 2 expf0[1=b + (x 0 )2 =2]gimizing DKL (S+1 ; G+1 ). But since we are looking at a parametric 1 0a01=2 0(a + 1=2) 1 1 2family, we just need to look at the update of the parameters that specify = p dxF (x) + (x 0 )a member of G . This leads to the new f! ; a; bg, obtained from the set ^ a 2 0(a)b 01 b 2of p h!i iS where 1 ! N and = +1 ! 1 +1 = N . Since a 1 is ﬁxed ^ i !+1 ^ = (11) to an integer value, the last integral can be written as the (a 0 1)th hiS derivative with respect to 1=b 9(a+1 ) 0 ln(b+1 ) = hln()iS (12) 0(a + 1=2) (02) a01 @ a01 1 a+1 b+1 = hiS p dxF (x) (13) 4= 2 0(a)b (2a 0 1)!! a 0 a01 @ b 1 01where the Digamma function is 9(x) = dfln 0(x)g=dx. 1 1 03=2 To proceed further with the analysis and interpretation of the results 2 2 + (x 0 ) (16)we reduce the number of dynamical variables. Thus, we ﬁx a = a to b 2a constant value for all . Large or small values of a dictate whetherthe probability of small values of is high or low. The ﬁrst and second which for F (x) = 2(x) can be easily calculatedmomenta of G()1 are a+1 b+1 and a+1 b2 +1 , respectively. This 0(a + 1=2) (02) a01 4= p (17)can be used to obtain, from (11)–(13), that the learning algorithm is 0(a)b (2a 0 1)!! a @ a01given by 1 b ! +1 ^ = ! + ^ @! ^ lnhP (y+1 j ! ; )iG (14) 0 a01 b 1+ : (18) ab+1 @ b 1 2 2 + b 2 b b+1 = b + @b lnhP (yt+1 j ! ; )iG (15) Very small values of a should represent cases where very rapid for- awhich is also a gradient-descent algorithm, along the gradient of getting of the data should occur, since the greatest contribution to thelnhP (yt+1 j ! ; )iG . This is a general algorithm, ﬁxed by the like- mass of the density would be at the origin. Conversely for large valueslihood, which is itself determined by the choice of the model, the of a, little forgetting occurs. We choose to study now a = 2, for twoarchitecture, and eventual noise corruption. reasons: ﬁrst, because it is easy to treat, and second, because it repre- We could use it now in the analysis of a real problem and compare it sents an intermediate value. It results into other algorithms. We are, however, more interested in investigating 1 b 1 4= 1+ 1+ 2 (19)the properties of the algorithm itself and understanding how it adapts 2 2 2 + b 2 + b to varying environments. We do this by looking at how it works in a that with (14) and (15) gives the new Bayesian online adaptive scalarsimple case, rich enough to reveal the details, yet sufﬁciently simple (BOAS) algorithmthat we can work out the mathematical details. In Section III, we apply +1 F! b ! +1 ^ = ! + ^ (20)the algorithm (14)–(15) to the simplest classiﬁer, the preceptron with 2b+1 N +1no hidden units. 1 b+1 = 1+ Fb b (21) 4 III. AN APPLICATION: THE PERCEPTRON where F! = 40 =4; Fbp = b 40 =4; 40 2 05=2 = 3(2 + b ) , and p The likelihood for the single-layer Boolean perceptron can be written +1! 1 +1 = N . = ^as P (yt+1 j ! ; ) = F (+1! 1 +1 = N ). If there is no noise, then Note that is an interesting variable since the output for an ex-the probability is either one or zero, depending on whether the example ample is the sign of and so is large in absolute value when thewas correctly classiﬁed, so F (x) = 2(x), the step function, which sign is clear. It is positive if the classiﬁcation is correct and negative ifis the case we now study . It is extremely simple at this point to not. Fig. 1 shows the form of the changes in the weights and in b+1include different noise processes and study their effects. This would as a function of a rescaled . We now apply this algorithm in differenthowever lead us astray from our present purpose, which is to analyze cases. In the ﬁrst case, the rule suddenly changes to a new one, uncor-the simple algorithm that will result in a case of nonstationary rules. It related to the previous one. This tests if the effective annealing of the 1We can write (10) as G (! ; ) = G(! j )G() where the prior is G() = algorithm is able to cope with several changes or if it is only able to e =0(a)b . learn the ﬁrst few cases while growing older for subsequent changes. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
1018 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010Fig. 1. (a) Effective modulation function as a function of the rescaled ﬁeld b . (b) Change in the effective window size as a function of b . Fig. 3. Generalization error as a function of D for the perceptron learning a random walk (circles) and a run away rule (triangles). The curves were obtained for N = 100 and averaged over 100 runs and they give e 0:36D for the random walk rule and e 0:54D for the run away rule, with standard O deviation (10 ) in all coefﬁcients and exponents. N -dimensional unit sphere and ! running away from ! . Both cases ^ can be written as ! +1 = 1 0 3+1 ! + 1 +1 : (22) N N In the random walk case, the components of the drift vector were drawn from a Gaussian distribution N (0; D 1 I ), independently hi ()j ( )i = 2Dij (23)Fig. 2. Generalization error as function of the number of examples ( = =N )for the BOAS learning rules under sudden abrupt changes. The simulations were with I as an identity matrix, 3+1 = +1 1 ! + D , and initial performed for N = 100 and averaged over 100 runs (error bars are smaller than condition ! 0 1 ! 0 = 1, to ensure normalization up to the ﬁrst order in 1=N .the symbol size.) For the runaway behavior, following , we imposedNext we model change of the rule as a random walk and see how thesystem is able to track the drifting rule. 2D 0 (D=N )2 +1 = 0 1 0 2 ! =k! k ^ ^ (24) 3+1 = D=N 0 k +1 kA. Sudden Changes and Drift The perceptron learns a constant rule deﬁned by another perceptron where = ! 1 ! =k! kk! k. ^ ^with a ﬁxed weight vector ! using algorithm (20)–(21). We study three The asymptotic performance of the algorithm is shown in Fig. 3, withscenarios, one of abrupt separation by static periods, another where a residual generalization error of eg 0:36D0:24 eg 0:36D1=4 forthere is a constant random drift or when the rule tries to escape from the random walk drift case and eg 0:54D0:21 eg 0:54D1=5 forthe classiﬁer. the runaway case. This means that the BOAS algorithm reaches a per- An abrupt change occurs after a certain amount of P = N exam- formance similar to the tensorial BOnA  and the optimal algorithmples have been presented. A new, uncorrelated vector ! is chosen. The (OA) obtained by a variational approach .performance of the BOAS algorithm, measured by the generalizationerror, is shown in Fig. 2, which shows a complete relearning of new IV. DISCUSSIONrules, with the same efﬁciency as the ﬁrst one. Naturally, if such abrupt changes occurred at every time step, it For the particular case of the single-layer perceptron, we can under-would not be possible to learn anything, so we applied the BOAS stand heuristically the features that make the learning algorithm efﬁ-algorithm to drifting scenarios , ,  to observe its capacity cient. Fig. 1 is particularly interesting in showing how the algorithmof learning under continuous and moderate changes. We consider works and which features will most likely be found in more complextwo cases: with ! performing a random walk on the surface of an learning scenarios where information ages. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 6, JUNE 2010 1019 predictions about thermodynamic properties of physical systems, based on the available information about expected values of energy. When the system evolves and data turn old, and old data become less impor- tant, the energy includes a forgetting effect, measured by , and a term to include new information: E+1 = E + V+1 . We started with a Gaussian but admitted that its width might change in an unknown fashion. This lack of knowledge is dealt with by introducing a proba- bility density for its (inverse) width. The integration over the different possible widths leads in turn to a power law, since averaged over all possible time windows, the marginal distribution is 0a01=2 k! 0 ! k2 a G (! ) = dG (! ; ) / 1+ ^ : b A Gaussian prior would tend to eliminate very strongly weight vec-Fig. 4. Generalization error as a function of the number of examples ( = tors that are too far from the current mean. The ability to escape from=N ) for the perceptron learning a stationary rule. The simulations were per- current knowledge when the rule changes hinges on the prior not beingformed for N = 100 and averaged over 100 runs and the solid line representsthe curve e = 0:88=. too close to zero for the new rule. The algorithm uses errors as signals that lead to an increase on the width of the prior, helping to decrease perseverance. Of course, if the prior is set to zero at a particular weight vector, then no amount of evidence will permit learning it. If there is a TABLE I ASYMPTOTIC GENERALIZATION ERROR FOR THE RANDOM WALK (RW) possibility of being wrong, and this is probably true for any situation, AND RUN AWAY (RA) DYNAMICS the prior should not be set arbitrarily close to zero. In our example, the willingness to accept the possibility of being wrong is manifest through the development of heavy tails. The need to determine the form of prior distributions from different types of prior information will increase as the success of current methods becomes more evident. Every possibility of new information should be reﬂected on properties of the prior. If this were demonstrably not true for a particular type of information it would be an important development. In Fig. 1(a), we have the effective modulation function as a function We have proposed a simple, but by no means complete, way to in-of the rescaled ﬁeld b . It determines the size of the updates of the clude aging in a Bayesian way. For a simple machine, we studied theweights in the direction of the example vector. For positive , when new algorithm quite thoroughly. We showed that it learns the rule inthe example is correctly classiﬁed, a very small change is made. As b the stationary case (Fig. 4)  and is able to track down changes ingrows larger, which will tend to happen when the generalization error the environment by forgetting in a Bayesian inspired way. These theo-decays, this correction gets smaller. For negative , when the example retical results should not be judged by the application we present here.is misclassiﬁed, a very large modiﬁcation is done. But if the error is The fact that on such a simple problem it performs very well shouldtoo great, the size of the correction decreases as if the conﬁdence in not be confused with the proof of having built a practical algorithm.the supervision decreased. Fig. 1(b) shows the update of the effective The main advantage of our method is to identify desired features ofwindow size (b+1 )=(b ) as a function of . The effective width of adaptive algorithms for nonstationary environments which follow fromthe posterior varies inversely with and so with b . For positive quite general information theory considerations.note that (b+1 )=(b ) 1, which means that if the predicted label iscorrect, then the effective width of the prior is reduced. Getting it rightmakes the machine surer about the rule. On the other hand, for 0, ACKNOWLEDGMENT(b+1 )=(b ) 1, which means that the effective width of the prior is The authors would like to thank R. Alamino and M. Opper for valu-increased. An error indicates that the rule might have changed and in able discussions.order to capture the new rule a wider prior will be needed. Learningoccurs by changing the mean of the prior and decreasing its width.Relearning occurs by widening the prior and making larger steps when REFERENCESchanging the mean.  A. Gifﬁn and A. Caticha, “Updating probabilities with data and mo- The scope of the method we have studied, however, extends be- ments,” in Proc. Bayesian Inference Maximum Entropy Methods Sci. Eng. Conf., K. H. Knuth, A. Caticha, J. L. Center, A. Gifﬁn, and C. C.yond the simple case of linear separability. We introduced a method Rodriguez, Eds., Nov. 2007, vol. 954, pp. 74–84.to take into account the possibility that data loses information con-  J. Schlimmer and R. G. Jr, “Incremental learning from noisy data,”tent due to the time evolution of the underlying problem. We discussed Mach. Learn., vol. 1, no. 3, pp. 317–354, 1986.ﬁrst the problem in terms of energy-like terms and then introduced a  I. Koychev and R. Lothian, “Tracking drifting concepts by timeBayesian method. However ill deﬁned the ideas of energy in this con- window optimisation,” in Proc. Res. Develop. Intell. Syst. XXII/25th SGAI Int. Conf. Innovative Tech. Appl. Artif. Intell., M. Bramer, F.text sound to the reader, it should not be forgotten that historically, Coenen, and T. Allen, Eds., 2006, pp. 46–59.probably the major example of the use of Bayesian ideas can be found  M. Biehl and H. Schwarze, “Learning drifting concepts with neuralin the founding works of statistical mechanics, a theory used to make networks,” J. Phys. A, Math. Gen., vol. 26, pp. 2651–2665, June 1993. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:08:27 UTC from IEEE Xplore. Restrictions apply.