Adaptive Learning Algorithms for Bayesian Network Classifiers

6,824 views

Published on

This presentation is concerned with adaptive learning algorithms for Bayesian
network classifiers in on-line learning scenarios. An efficient supervised learning algorithm in such a scenario must be able to improve its predictive accuracy by incorporating the incoming new data, while optimizing the cost of updating. However, if the process is not strictly stationary, the target concept could change over time. We present an unified, adaptive prequential framework for supervised learning
called AdPreqFr4SL, which attempts to handle the
cost-performance trade-off and deal with concept drift. We experimentally prove the advantages of using the AdPreqFr4SL in comparison against its non-adaptive versions for the particular class of k-Dependence Bayesian Classifiers (k-DBCs)

Adaptive Learning Algorithms for Bayesian Network Classifiers

  1. 1. Adaptive Algorithms for Bayesian Network Classifiers PhD Thesis Department of Mathematics University of Aveiro September 2006 Gladys Castillo Jordán Supervisors: João Gama Ana M. Breda
  2. 2. Thesis Focus <ul><li>Machine Learning Problem : Supervised Learning </li></ul><ul><ul><li>learn a classifier from data </li></ul></ul><ul><ul><ul><li>Classification applied in medical diagnosing, user modeling, etc. </li></ul></ul></ul><ul><li>Model : Bayesian Network Classifiers (BNCs) </li></ul><ul><ul><li>learning BNCs from data  build probabilistic graphical models connection : Probability Theory, Statistical Inference, Graph Theory and Optimization </li></ul></ul><ul><li>Learning Scenario : On-Line Learning </li></ul><ul><ul><li>data arrives sequentially  first prediction, then updating </li></ul></ul><ul><li>AIM : Development of adaptive algorithms for BNCs capable of handling: </li></ul><ul><ul><ul><li>Trade off: cost of updating vs. gain in performance </li></ul></ul></ul><ul><ul><ul><li>concept drift </li></ul></ul></ul>
  3. 3. Outline <ul><li>Introduction </li></ul><ul><ul><li>The Supervised Classification Problem </li></ul></ul><ul><ul><li>Naïve Bayes Classifier </li></ul></ul><ul><ul><li>Improving Naïve Bayes </li></ul></ul><ul><li>Bayesian Network Classifiers </li></ul><ul><ul><li>k-Dependence Bayesian Network Classifiers </li></ul></ul><ul><ul><li>Learning Bayesian Network Classifiers </li></ul></ul><ul><ul><ul><li>Choosing a class-model </li></ul></ul></ul><ul><li>The Adaptive Prequential Framework for Supervised Learning </li></ul><ul><ul><li>Cost-Quality Management </li></ul></ul><ul><ul><li>Concept Drift Management </li></ul></ul><ul><li>Conclusions and Future Directions </li></ul>
  4. 4. The Supervised Classification Problem A classifier is a function f :  X   C that assigns a class label c   C ={ c 1 ,…, c m } to objects described by a set of attributes X ={ X 1 , X 2 , …, X n }   X Supervised Learning Algorithm h C h C C ( N +1 ) Inputs: Attribute values of x ( N +1) Output: class of x ( N +1) Classification Phase: the class attached to x ( N +1) is c ( N +1) = h C ( x ( N +1) )   C Given: a dataset D with N labeled examples of < X , C > Build: a classifier, a hypothesis h C :  X   C that can correctly predict the class labels of new objects Learning Phase: attribute X 2 attribute X 1 x x x x x x x x x x o o o o o o o o o o o o o give credit don´t give credit …
  5. 5. Bayesian Classifiers <ul><li>statistical classifiers : give the probability P ( c j | x ) that x belongs to a particular class rather than a simple classification </li></ul><ul><li>Bayesian because the class c* attached to the example x is determined by the Bayes’ Theorem </li></ul><ul><li>Bayesian classifiers find the MAP hypothesis given the example x . </li></ul>applying Bayes’ Theorem  When input space is high dimensional direct estimation of P( x | c j ) is hard unless we introduce some assumptions posterior  prior x likelihood
  6. 6. Naïve Bayes (NB) Classifier <ul><li>“ Naïve” because of its very naïve independence assumption: </li></ul>all the attributes are conditionally independent given the class Duda and Hart (1973); Langley (1992) applying Bayes’ Theorem  P ( x | c j ) can be decomposed into a product of n terms, one term for each attribute <ul><li>“ Bayes” because the class c* attached to an example x is determined by the Bayes’ Theorem </li></ul>
  7. 7. Naïve Bayes Performance <ul><li>NB has HIGH BIAS (independence assumption in practice is violated) but it is compensated by LOW VARIANCE (few parameters to learn) </li></ul><ul><li>As data increases variance will vanish  focus on bias management </li></ul>“ certain types of (very high) bias can be canceled by low variance to produce accurate classification ” Friedman (1997) the degree to which the predictions differ from the central tendency from sample to sample the central tendency for the predictions of the classifiers induced from different samples
  8. 8. Improving Naïve Bayes <ul><li>reducing the bias resulting from the modeling error </li></ul><ul><ul><li>by relaxing the attribute independence assumption </li></ul></ul><ul><ul><li>one natural extension: Bayesian Networks </li></ul></ul><ul><li>reducing the bias of the parameter estimates </li></ul><ul><ul><li>by improving the probability estimates computed from data </li></ul></ul>Relevant works: <ul><li>Web and Pazzani (1998) - “Adjusted probability naive Bayesian induction” in LNCS v 1502 </li></ul><ul><li>J . Gama (2001, 2003) - “Iterative Bayes”, in Theoretical Computer Science, v. 292 </li></ul><ul><li>Friedman, Geiger and Goldszmidt (1998) “Bayesian Network Classifiers” in Machine Learning, 29 </li></ul><ul><li>Pazzani (1995) - “Searching for attribute dependencies in Bayesian Network Classifiers” in Proc. of the 5th Workshop of Artificial Intelligence and Statistics </li></ul><ul><li>Keogh and Pazzani (1999) - “Learning augmented Bayesian classifiers…”, in Theoretical Computer Science, v. 292 </li></ul>Relevant works:
  9. 9. Iterative Bayes: Improving the Parameter Estimates - - + Observed =Right Predicted =Right 59% confidence level Iterative Bayes cycles through training examples Observed =Right Predicted =Right 79% confidence level after Iterative Bayes : João Gama, Iterative Bayes, Intelligent Data Analysis 4, 475-488, IOS Pres, 2000  example 1º. Compute the increment Delta = (1- P(Predicted|Example) Delta= 0.413022 > 0 if correct Start Adaptation The confidence level of the predict class increases and on the other classes decreases 2º. Increment is added or subtracted to the entries of the corresponding column of contingency tables 0.277796 P(Left|x) 0.586978 0.135227 P(Right|x) P(Balanc.|x) 0.210816 P(Left|x) 0.789009 0.000175 P(Right|x) P(Balanc.|x)
  10. 10. Outline <ul><li>Introduction </li></ul><ul><ul><li>The Supervised Classification Problem </li></ul></ul><ul><ul><li>Naïve Bayes Classifier </li></ul></ul><ul><ul><li>Improving Naïve Bayes </li></ul></ul><ul><li>Bayesian Network Classifiers </li></ul><ul><ul><li>k-Dependence Bayesian Network Classifiers </li></ul></ul><ul><ul><li>Learning Bayesian Network Classifiers </li></ul></ul><ul><ul><ul><li>Choosing a class-model </li></ul></ul></ul><ul><li>The Adaptive Prequential Framework for Supervised Learning </li></ul><ul><ul><li>Cost-Quality Management </li></ul></ul><ul><ul><li>Concept Drift Management </li></ul></ul><ul><li>Conclusions and Future Directions </li></ul>
  11. 11. Bayesian Networks <ul><li>the structure S - a Directed Acyclic Graph (DAG) </li></ul><ul><ul><li>nodes – random variables </li></ul></ul><ul><ul><li>arcs - direct dependence between variables </li></ul></ul>Bayesian Networks (BNs) graphically represent the joint probability distribution of a set X of random variables in a problem domain A BN=( S ,  S ) consist in two components: <ul><li>the set of parameters  S </li></ul><ul><ul><li>conditional probability distributions (CPDs) </li></ul></ul><ul><ul><li>If variables are discrete CPDs are conditional probability tables (CPTs) </li></ul></ul>Qualitative part Quantitative part each node is independent of its non descendants given its parents in S  Wet grass BN from Hugin Pearl (1988, 2000); Jensen (1996); Lauritzen (1996) Probability theory Graph theory Markov condition :
  12. 12. Bayesian Network Classifiers <ul><li>Given an example x we can compute the predictive distribution P ( C |x, S ) by marginalizing the joint distribution P ( C , x | S ) </li></ul>c* = h BNC ( x) = arg max P ( c j , x | S ) j =1…m <ul><li>The class c* attached to x is: </li></ul>A BN can be used as a classifier that gives the posterior probability distribution of the class node C given the values of other attributes <ul><li>Restricted approach : </li></ul><ul><li>contains the NB structure </li></ul><ul><li>Unrestricted approach : </li></ul><ul><li>the class node is treated as an ordinary node </li></ul>A T ree A ugmented N B (TAN) A B N A ugmented N B (BAN) A General B N (GBN) structure
  13. 13. Learning Bayesian Network Classifiers <ul><li>Choose a suitable class (class-model) of BNCs </li></ul><ul><ul><li>For example a BAN can be chosen </li></ul></ul><ul><li>Choose a structure within this class-model that best fit the data – this defines a parametric probabilistic model </li></ul><ul><li>3. Estimate the parameters of the set  S from data </li></ul>Learning Task : Given data D of i.i.d examples find the BNC that best fit the data in some sense Learning Algorithm h C = ( S,  S ) D OUR FOCUS
  14. 14. Learning the Structure as a Search Problem <ul><li>Our focus: score-based approaches </li></ul><ul><li>a class-model of BNCs that defines the space S of possible networks structures (DAGs) </li></ul><ul><li>a scoring function Score ( S , D ) to measure the degree of fit of each structure S </li></ul>Discrete Optimization Problem : Find the structure S  S that maximizes the scoring function Score ( S, D ) <ul><li>NP hard optimization ( Chickering (1994) ) </li></ul><ul><li>use heuristic search algorithms : explores the space of candidate networks while optimizing the score </li></ul><ul><ul><li>e.g: greedy hill climbing , simulated annealing, etc. </li></ul></ul>Must choose :
  15. 15. k-Dependence Bayesian Classifiers <ul><li>Definition : A k-DBC is a Bayesian network which: </li></ul><ul><ul><li>contains the structure of NB </li></ul></ul><ul><ul><li>allows each attribute X i to have a maximum of k attribute nodes as parents </li></ul></ul>Sahami, M., 1996 k-DBCs represent an unified framework for all the BNCs class-models containing the structure of the Naïve Bayes Model’s complexity increases By varying the value of k we can obtain classifiers that smoothly move along the spectrum of attribute dependences We can control the complexity of k-DBCs by choosing an appropriate k value NB is a 0-DBC TAN is a 1-DBC This BAN is a 2-DBC
  16. 16. Choosing the Class-Model <ul><li>Classifier too simple (e.g. NB) </li></ul><ul><ul><li>underfit the data, too much bias </li></ul></ul><ul><li>Classifier too complex </li></ul><ul><ul><li>overfit the data, too much variance </li></ul></ul>Behavior of test error and training error varying model complexity High Bias Low Variance Low Bias High Variance Low High Optimal overfitting underfitting there is an optimal model that gives minimum test error Learning algorithms should be able to select a model with the appropriate complexity for the available data <ul><li>Controling the BNCs complexity </li></ul><ul><ul><li>BNC class-models: NB, TAN, BAN, etc., differ by the number of parents allowed for attribute </li></ul></ul><ul><ul><li>choosing of the appropriate class-model of BNCs  not trivial : depend on the chosen score and available data </li></ul></ul>
  17. 17. What class-model to choose? 1000 training examples Varying k , the score and the training set size can have different effects on bias and variance and consequently in the test error 12500 training examples If the data increases, it makes sense to gradually increase the k value to adjust the complexity of the class-model, and hence, the complexity of BNCs to the available data Results from an empirical study conducted with the class of k-DBCs and the nursery dataset bias variance
  18. 18. <ul><li>Introduction </li></ul><ul><ul><li>The Supervised Classification Problem </li></ul></ul><ul><ul><li>Naïve Bayes Classifier </li></ul></ul><ul><ul><li>Improving Naïve Bayes </li></ul></ul><ul><li>Bayesian Network Classifiers </li></ul><ul><ul><li>k-Dependence Bayesian Network Classifiers </li></ul></ul><ul><ul><li>Learning Bayesian Network Classifiers </li></ul></ul><ul><ul><li>Choosing a class-model </li></ul></ul><ul><li>The Adaptive Prequential Framework for Supervised Learning </li></ul><ul><ul><li>Cost-Quality Management </li></ul></ul><ul><ul><li>Concept Drift Management </li></ul></ul><ul><li>Conclusions and Future Directions </li></ul>Outline
  19. 19. The Prequential Learning Framework OnLine Learning = First Prediction, Next Updating Dawid’s Prequential Approach = Pre diction + Se quential Updating <ul><li>We assume: </li></ul><ul><li>on-line learning framework </li></ul><ul><li>changing environment </li></ul><ul><li>data arrives in batches </li></ul><ul><li>a unique hypothesis is maintained </li></ul>
  20. 20. Requirements <ul><li>Handle the trade-off between the cost of updating and the gain in performance </li></ul><ul><ul><li>If the cost of updating is high  decide whether it is inevitable to trigger the updating process to improve the performance </li></ul></ul><ul><li>2 . Handle Concept Drift </li></ul><ul><ul><li>Decide whether adaptation is really necessary because a concept change has occurred </li></ul></ul><ul><ul><li>Coping with concept drift falls into two subtasks: </li></ul></ul><ul><li>First: detect concept drift </li></ul><ul><li>Next: adapt to the changes , accordingly </li></ul>
  21. 21. The Adaptive Prequential Framework ( AdPreqFr4SL ) At least one performance’s indicator is monitored over time to infer the current system’s state. The AdPreqFr4SL is provided with controlling mechanisms that try to select the best adaptive actions according to the inferred state The current hypothesis is adapted according to the estimated state.
  22. 22. Indicators and States <ul><li>Err B – the batch error : the proportion of misclassified examples in one batch </li></ul><ul><li>Err S – the model error : the proportion of misclassified examples in the total of examples that were classified using the same structure </li></ul>In the AdPreqFr4SL two performance indicators are monitored over time in order to estimate one of the following states: S1 - IS IMPROVING : the performance is improving S2 - STOP IMPROVING : the performance stops improving S3 - CONCEPT DRIFT ALERT : a first alert of concept drift was signaled S4 - CONCEPT DRIFT : there is a gradual concept change S5 - CONCEPT SHIFT : there is a abrupt concept change S6 - STABLE PERFORMANCE : the performance achieves a plateau. It is the goal state.
  23. 23. Cost-Quality Management Two Main Policies: <ul><li>bias management : improve the performance by trying to select the optimal class-model for current data </li></ul><ul><li>MOTIVATION : reduce the error by reducing bias or variance </li></ul><ul><ul><li>If few data : reduce variance by using simpler models </li></ul></ul><ul><ul><li>as data increases : variance will decrease, focus on bias management </li></ul></ul><ul><ul><li>RATIONAL : </li></ul></ul><ul><ul><ul><li>Start with NB (set k to 0) </li></ul></ul></ul><ul><ul><ul><li>as data increases: scale up complexity by gradually increasing k </li></ul></ul></ul><ul><li>gradual adaptation : reduce the cost of updating by trying to adapt only the parameters </li></ul><ul><ul><li>RATIONAL : </li></ul></ul><ul><ul><ul><li>adapt the structure only if the performance stops improving </li></ul></ul></ul><ul><ul><ul><li>stop adapting (both parameters and structure) if increasing the training data will not result in further improvements </li></ul></ul></ul>
  24. 24. Adaptation Policy <ul><li>FIRST LEVEL : only parameter adaptation </li></ul><ul><li>IF performance stops improving using the current structure THEN go to SECOND LEVEL </li></ul><ul><li>SECOND LEVEL : structure adaptation </li></ul><ul><ul><li>adapt the current structure without extending the search space by removing existing dependences or adding new dependences </li></ul></ul><ul><ul><li>IF the structure remains the same THEN go to THIRD LEVEL </li></ul></ul><ul><li>THIRD LEVEL : IF k < kMax THEN extend the search space & continue searching </li></ul><ul><ul><ul><li>k  k +1 and search for new dependences … continue until the performance reaches a plateau </li></ul></ul></ul><ul><ul><ul><li>THE ADAPTATION PROCESS IS STOPPED </li></ul></ul></ul>INITIAL LEVEL : A new hypothesis is built using the NB (k  0 )
  25. 25. Sequential Updating of the Structure of k-DBCs <ul><li>search space : B-space = ( S , O ) where: </li></ul><ul><ul><li>S - the space of possible k -DBCs restricted by the choosen k </li></ul></ul><ul><ul><li>O  { addArc, deleteArc} </li></ul></ul><ul><li>initial solution : First, initialize to Naïve Bayes. Then, the current structure is used. </li></ul><ul><li>objective function : a scoring function </li></ul><ul><li>stopping criterion : stops when there is no more improvements of the score or when there is no possible to apply a new operator </li></ul>we assume we are able to keep all the sufficient statistics needed to compute the score for each candidate structure <ul><li>Input: k, score, sufficient statistics </li></ul><ul><li>Output: a k-DBC structure </li></ul><ul><li>Init : the current structure </li></ul><ul><li>at each search step : </li></ul><ul><ul><li>apply the operator that more increases the score </li></ul></ul><ul><li>STOP searching when the stopping criterion is met </li></ul>If the current structure is NB  use only arc additions. Otherwise  use also arc deletions for correcting from previous errors Using Hill-Climbing Search :
  26. 26. Control Criteria <ul><li>The performance stops improving if for the last observed number of points (we use the last 7 points) : </li></ul><ul><li>Monitoring of the model error Err S by means of its learning curve model-LC </li></ul>1. the model-LC starts behaving well  the curve is c onvex and monotically non-increasing 2. its slope is gentle <ul><li>S1 is met if for a given  1 (threshold for the gentle slope ): </li></ul><ul><li>S2 is met if for a given  2 (threshold for the plateau ) , such that  2 <  1 : </li></ul>state S1 : the performance stops improving in desirable tempo state S2 : the performance reached a plateau the proportion of examples misclassified by the classifiers induced with the same actual structure Detect two situations: A(T) > 0 A(T) < 0
  27. 27. An Example with the Adult Dataset
  28. 28. Experimental Results with three UCI Datasets Adaptive algorithms Adap1 and Adap2 approach the performance of the best k-DBC induced from scratch at each learning step Adap2 additionally implements Iterative Bayes (Gama, 2003) to improve the parameter estimates By gradually extending the search space and adding attribute dependences allow improving the classification results over time
  29. 29. Final Results for three UCI Datasets <ul><li>Adaptive algorithms can perform a more artful cost-quality trade-off than their non-adaptive versions: </li></ul><ul><ul><li> the final error approaches the final error of the best k-DBC while considerable reducing the cost of updating (few structure adaptations) </li></ul></ul><ul><li>Adap2 can ensure a better cost-quality trade-off  the number of structure adaptations and the resulting error are smaller </li></ul>% de erro no ultimo grupo # adaptações na estrutura 2.4 2.6 12.2 13.20 5.2 11.6 7.0 4.40 3.8 4.2 6.0 0.00 Adap2 3.2 4.0 16.8 13.60 6.0 18.8 8.0 6.80 3.2 4.7 5.6 0.30 Adap1 8.6 160 12.0 14.60 6.4 128 5.4 5.80 18.5 100 6.0 0.00 Best 13.2 160 18.0 14.60 9.8 128 9.0 7.20 18.5 100 6.0 0.00 3-DBC 13.,4 160 18.0 14.60 8.8 128 8.0 7.40 17.6 100 5.0 4.60 2-DBC 8.6 160 12.0 14.60 6.4 128 5.4 5.80 11.2 100 3.0 5.80 1-DBC       16.80       12.00       8.10 NB Str. Chang. Str. Adap. Add. Arcs Last Error Str. Chang. Str. Adap. Add. Arcs Last Error Str. Chang. Str. Adap. Add. Arcs Last Error k-DBCs Adult Nursery Balance
  30. 30. <ul><li>Introduction </li></ul><ul><ul><li>The Supervised Classification Problem </li></ul></ul><ul><ul><li>Naïve Bayes Classifier </li></ul></ul><ul><ul><li>Improving Naïve Bayes </li></ul></ul><ul><li>Bayesian Network Classifiers </li></ul><ul><ul><li>k-Dependence Bayesian Network Classifiers </li></ul></ul><ul><ul><li>Learning Bayesian Network Classifiers </li></ul></ul><ul><ul><li>Choosing a class-model </li></ul></ul><ul><li>The Adaptive Prequential Framework for Supervised Learning </li></ul><ul><ul><li>Cost-Quality Management </li></ul></ul><ul><ul><li>Concept Drift Management </li></ul></ul><ul><li>Conclusions and Future Directions </li></ul>Outline
  31. 31. The Concept Drift Problem Stagger Concepts: Schlimmer & Granger (1986) used it to test the first concept drift tracking system: simple blocks world defined by 3 nominal attributes: size , color , shape. A sequence of 3 target functions (concepts) is used: - - - Naive Bayes __ an adaptive algorithm Time Concept drift effects :severe degradation in accuracy concept drift : changes in the target function over time concept drift (gradual changes) vs. concept shift (abrupt changes) Concept drift scenarios require adaptive learning algorithms , able to adapt the learner quickly to these changes
  32. 32. Concept Drift Management <ul><li>first try to detect concept changes , and next to adapt the learner accordingly </li></ul><ul><li>to detect concept changes some indicators ( performance measures , properties of data , etc.) are monitored over time </li></ul><ul><li>if the monitoring process detect changes, actions are taken to adapt the learner </li></ul>Two Strategies adapt the learner regularly without considering whether changes have really happened <ul><li>e.g. </li></ul><ul><li>Weighted examples (Kunisch 1996, Allan 1996, Balanovich 1996) </li></ul><ul><li>time windows of fixed size </li></ul><ul><li>e.g.: </li></ul><ul><li>FLORA algorithms (Widmer & Kubak 1996) </li></ul><ul><li>Klinkenberg&Renz (1998, 2000) </li></ul><ul><li>Lanquillon (2001) </li></ul>
  33. 33. A P-Chart for Handling Concept Drift <ul><li>Monitoring of the batch error Err B by means of a Shewhart P-Chart </li></ul><ul><li>The target value to set the center line is set to the minimum value of the actual model-LC </li></ul>
  34. 34. The Adaptive Algorithm for Handling Concept Drift
  35. 35. An Example with Gradual Changes for a Generated CDS k 0 1 0 0 0 1 2 3 0 1 2 3 4 0 1 2 3 4
  36. 36. Results with Generated Concept Shift and Drift Scenarios <ul><li>Significant improvements in the performance are achieved by using adaptive algorithms instead of their non-adaptive versions </li></ul><ul><li>Adaptive algorithms can control the performance, trying to improve it back to a level, that even approaches the true generative model </li></ul>
  37. 37. Conclusions <ul><li>All the adaptive algorithms are integrated into the unified framework AdPreqFr4SL , which tries to handle cost vs. quality trade-off and concept drift </li></ul><ul><ul><li>adaptive strategy based upon bias management and adaptation control automatically selects the appropriate class-model for the available data </li></ul></ul><ul><ul><li>Adaptive and control policies for bias management could be applied to any algorithm based on parametric models and discrete search with a hierarchical and increasing control over the complexity </li></ul></ul><ul><ul><li>the drift detection method based on the P-Chart has been demonstrated to be efficient for handling gradual and abrupt concept changes - a simple, well-argued, statistically-driven method and independent of the learning algorithms, which makes it broadly applicable </li></ul></ul>
  38. 38. Future Work <ul><li>Issues in Sequential Updating of Bayesian Networks </li></ul><ul><ul><li>Include sophisticated data structures and methods for storing and computing the sufficient statistics in an incremental fashion </li></ul></ul><ul><ul><li>Explore the implementation of backtracking strategies (e.g. floating search methods) to avoid halting at local maxima </li></ul></ul><ul><li>Performing feature subset selection: </li></ul><ul><ul><li>By reducing the number of attributes we can reduce the parameters and hence, the variance of the test error  use the hill-climber with the restriction that not all the attributes can be added to the Naïve Bayes structure </li></ul></ul><ul><li>Application to real-world on-line systems: </li></ul><ul><ul><li>AdPreqFr4SL can be applied in those real-world applications where it is needed to refine the initial model on the light of new data and where concept drift can occur (e.g. user modeling) </li></ul></ul>
  39. 39. Thank you for your attention

×