Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Learning Bayesian Belief Network Classifiers for Proteome Analyst CMPUT551 Term Project Project Report Zhiyong Lu James Redford Xiaomeng Wu April 26, 2002
  2. 2. Table of Contents 1. ABSTRACT 2. INTRODUCTION 2.1 Description of the Task 2.2 Motivation 2.3 The Proteome Analyst 2.4 Our Solutions 2.5 Problems and Challenges 3. RELATED WORK 3.1 Proteome Analyst 3.2 NB vs. TAN 3.3 Discriminative Learning 4. APPROACHES 4.1 Overview 4.2 NB (generative vs. discriminative) 4.3 TAN 4.4 Neural Networks 4.5 Wrapper (Information Content) 4.6 Other approaches 5. EMPIRICAL ANALYSIS 5.1 Experimental Setup 5.1.1 Background on the Data Set 5.1.2 Training and Testing 5.2 Comparison of NB, TAN, and NN 5.3 Generative vs. Discriminative 5.4 Feature Selection—Wrapper 5.5 Miscellaneous Learning algorithms 5.6 Computational Efficiency 6. CONCLUSIONS and FUTURE WORK 7. REFERENCES 8. APPENDIX
  3. 3. 1. Abstract In this course project, we investigate several machine learning techniques on a specific task—Proteome Analyst. Naïve Bayes has been applied to this problem with considerable success. However, it makes many assumptions about data distributions that are clearly not true of real-world proteome. We empirically evaluate several variant algorithms of Naive Bayes, including the method in which parameters are learned (Generative vs. Discriminative Learning) and different BN structures (Naïve Bayes vs. TAN). We also implement a Neural Network algorithm and use some other existing tools such as the WEKA data mining system to perform an empirical analysis of these systems for the proteome function prediction problem. This report is organized as follows. In Section 1 we introduce the task in our project and our motivation and challenges we face. In Section 2 we review the previous work on proteome analyst and discuss some alternative solutions to the classification problem. In Section 3, we present the detailed concepts of those machine learning techniques and our implementations. In Section 4, we examine the proteome application for classification in detail, and show the comparative results of different techniques. We conclude in Section 5 and point out some future research directions in Section 6. Finally, Appendix A contains all the experimental data we used in the report.
  4. 4. 2. Introduction 2.1 Description of the Task Recently, more than 60 bacterial genomes and 5 eukaryotic genomes have been completed. This explosion of DNA sequence data is leading to a concomitant explosion in protein sequence data. Unfortunately, the function of over half of these proteome sequences is unknown. Therefore, the proteome function prediction problem has emerged as an interesting research topic in bioinformatics. In our project, we are given some protein sequences with known classes; our goal is to seek several machine learning techniques to predict the classes of unknown protein sequence. This is a typical machine- learning problem in the domain of classification area — learn from existing experience to perform the task better. 2.2 Motivation Typically it takes months or even years to determine the function of even a single protein using standard biochemical approaches. A much quicker alternative is to use computational techniques to predict protein functions. Although there are many existing algorithms such as Naïve Bayes available for the proteome function prediction, it often makes many assumptions about data distributions that are clearly not true of real-world proteome. The challenge is that we need some more generalized algorithms that not only do not rely on the above assumption but also achieve high-throughput performance including both classification accuracy and execution time. 2.3 The Proteome Analyst Proteome Analyst is an application designed by the PENCE group at the University of Alberta that carries out protein classification. The input to the Proteome Analyst is a protein sequence, and the output is a prediction of classification result. Figure 2.5.1 shows the architecture of the Proteome Analyst. The input protein sequence is initially fed though PsiBlast, which is a tool that does sequence alignment against a database, in this case SwissProt. The three best alignment matches, called homologues, returned by PsiBlast are in turn passed into a tokenizer. The tokenizer retrieves text descriptions of the homologues from the SwissProt database and then extracts a number of text tokens from these descriptions. These tokens are used as input into the classifier. Currently, the PENCE classifier is implemented as a Naïve Bayesian network (NB). The features used by the NB are binary and
  5. 5. correspond to the tokens. If the token exists in the input sequence’s description then the value of the feature is 1, otherwise the value is 0. The output of the NB is the classification of the input sequence. Protein Sequence PsiBlast Homologue s SwissProt Tokenize Tokens Classifie Classification Figure 2.5.1: Data flow architecture of the Proteome Analyst. Boxes, ovals, and arrows represent data, filters, and data flow respectively. SwissProt is a database. For our project, we are only concerned with the classifier portion of the Proteome Analyst. We used data files that were already tokenized and the data records already converted into classified vectors of binary features. See Table 2.5.1 for an example.
  6. 6. Class F1 F2 F3 F4 A 1 0 0 0 A 0 1 0 1 B 1 0 1 0 B 0 1 1 0 B 0 0 1 1 B 0 0 1 1 B 1 1 1 1 B 0 0 1 0 Table 2.5.1: An example of the format of the data files used in our project. 2.4 Our Solutions Since Naïve Bayes has been applied to the proteome function prediction with considerable success by the PENCE group at University of Alberta. We focus on two areas; the method in which parameters are learned, and the structure of the BN. However, we also seek some other machine-learning techniques such as Neural Networks and Support Vector Machines (SVM) to solve this specific problem. Our goal is to explore the optimal classifier with best performance on both classification accuracy and execution time during our empirical analysis. Following is the summary of machine learning techniques we have applied in our project:  Naïve Bayes (Generative vs. Discriminative Learning)  TAN (Tree-augmented Naïve Bayes)  Neural Networks  Decision Tree, Rule Learner… (Using WEKA data mining system)  Support Vector Machine 2.5 Challenges and Problems Our evaluation criteria for those different machine-learning techniques are mainly involve two:  Classification Accuracy  Execution Time During our experiments on the real data, we found, overall, the Naïve Bayesian classifier outperforms the other techniques, though it does not achieve the best classification and shortest execution time in our empirical study. For most of the other techniques, they might perform better than Naïve Bayes in one aspect, but lose significantly in the other respect. For example, the Decision Tree classifier, which achieves consistently 5 to 10 percentage higher accuracy than Naïve Bayes. But it takes more than 5 times to train the classifier. On the other hand, OneR in WEKA, another classifier, is easily
  7. 7. trained but has an accuracy of only 30%, which makes it unsuitable for our task. Interestingly, we found an alternative approach—SVM (Support Vector Machine) that achieves better classification accuracy with comparable execution time of Naïve Bayes.
  8. 8. 3. Related Work 3.1 Proteome Analyst PA (Proteome Analyst) is an application designed by the PENCE group at the University of Alberta that does protein classification. Currently, a PA user can upload proteome that consists of an arbitrary number of protein sequences in FastA format. A PA user can configure PA to perform several function prediction operations and can set up a workflow that will apply these operations in various orders, under various conditions. PA can be configured to use homology sequence comparison to compare each protein against a database of sequences with known functions. Any sequence with high sequence identity can then be assigned the function of its homologues and removed from further analysis (or not). One or more classification-based function predictors (that were using machine learning techniques) can also be applied to any sequence. More importantly, PA users can easily train their own custom classification- based predictors and apply them to their sequences. Many other function prediction operations are currently being developed and will be added to PA. 3.2 NB vs. TAN The NB and TAN components of this project were primarily based on work done by Friedman, Geiger, and Goldszmidt as described in their 1997 paper “Bayesian Network Classifiers” [1]. Friedman et al compare NB’s to TAN’s on a variety of data sets. They found that in most cases TAN methods were more accurate than Naïve Bayesian methods. Our goal is to determine if TAN’s are more accurate than NB’s for the PENCE data sets. Jia You and Russ Greiner, from the University of Alberta, have also done work on comparing different Bayesian classifiers, including NB and TAN classifiers [4]. 3.3 Discriminant Learning Naïve Bayesian and TAN are two different types of belief net structure, both first learns a good network structure and then fill in the according CPTable attached to each node. Essentially, all of these learners use the parameters that maximize the likelihood of the training samples [6]. Their goal is to produce a good model most close to the distribution of the data, which is the core idea belonging to the “Generative Classification”.
  9. 9. In general, there are two ways to make classification decisions, generative learning and discriminative learning, respectively. Generative learning is to build a model over the input examples in each class and classify based on how well the resulting class conditional models explain any new input example. The other method, discriminative learning, views the classification problem from a quite different angle from generative learning. It aims to maximize the classification accuracy instead of building the most accurate model closest to the underlying distribution. Thus, after obtaining a fixed structure, much more labor will be put to seeking the parameters that maximize the conditional likelihood, of the class label ci given the instance ei . R. Greiner and W. Zhou has done the related research work on discriminant parameter learning of belief net classifiers in general cases and found this kind of learning works effectively over a wide variety of situations [5].
  10. 10. 4. APPROACHES 4.1 Overview In our project, more than one machine learning techniques have been adopted, each with its own advantages and from different angles. Among probabilistic learners, we have implemented Naïve Bayesian Network (generative and discriminative) and TAN. (Notes: our implementation work is different from the existing system PENCE group used. All the code is built from scratch, in deed, and the id3 file format is adopted.) For these two classifiers, various experiments have been carried out by tuning different parameters. One important characteristic of our project is dealing with datasets having thousands of features. All of the three datasets we investigate (ecoli, yeast, and fly) have more than 1500 features. Problems exist when dealing with applications with many features, in that irrelevant features indeed provide little information and noisy features would make result worse. Moreover, standard algorithms do not scale well. “Wrapper”(with Information Content) is what we adopted in our project to handle the feature selection problems. Neural Network is the one we extended the existing code to implement our project. We have also tried several other techniques using available implementations, such as WEKA series and SVM (for muti-classes). For all of the experiments of above algorithms, cross-validation is the technique we use to get a precise testing accuracy. We also take the running efficiency into account in terms of execution time. In the last, comparisons for these different algorithms will be presented. 4.2 NB (generative vs. discriminative) 4.2.1 Overview Naïve Bayesian Network is one of the most practical learning methods for classification problems. It is applicable when dealing with large training data set, and the attributes that describe instances are conditionally independent given classification labels. The structure of Naïve Bayesian Network is simple and elegant based on the assumption that attributes are independent given class labels. Nodes are variables, and links between nodes represents the causal dependency. In NB, Node for class labels serves as the root of tree, all the features are the child nodes of root, and no sibling exists for each child node. Each node is attached
  11. 11. one Cptable, which is the parameter to learn for this structure. Every entry in Cptable is in the form of P(child|parent). In generative learning, entries in Cptable are populated with empirical frequency count. In discriminative learning for a given fixed structure (NB here), Cptable will be updated after each incoming query and thus try to produce the optimal classification error score. The inference of NB is based on Bayesian Theorem, and is carried out by picking up the max[P(vj)ΠP(ai|vj)] where ai is the attributes and vj is class label. 4.2.2 learning structure The Naïve Bayes Learn Algorithm goes in this way: For each target value Vj P’(Vj) ← estimaite P(Vj) For each attribute value ai of each attribute A P’(ai|Vj) ← estimaite P(ai|Vj) In deed, up to now, we have also filled up each Cptable for generative learning. 4.2.3 discriminative learning As said before, the parameters set in generative learning need not maximize the classification accuracy. However, a good classifier is the one that produces the appropriate answers to these unlabeled instances as often as possible [5]. “Classification error” is usually defined as: Err = P( class(e) != c) for <e,c> in sample This can be approximated by empirical score : Err’ = 1 Σ ( class(e)!= c) for <e,c> in sample S In discriminative learning for Naïve Bayesian Network, the goal is to learn the Cptable entries for the given NB structure to produce the smallest empirical error score above. Therefore, the “log conditional likelihood” of given NB over the distribution of labeled instances are used: LCL = Σ P(e) log P(c|e) for <e,c> in sample Similarly, this log conditional likelihood can also be approximated by: LCL’ = 1 Σ log P(c|e) for <e,c> in sample S
  12. 12. To get the CPtable entries that have the optimal conditional likelihood, a simple gradient-descent algorithm is used [5]. The initialization of CPtable can be obtained by the usual way to fill in the CPtable, then the empirical error score is improved by changing the value of each CPtable entry. In the implementation, “softmax” parameters are adopted. The advantage is to keep the probability property: in the range between 0 and 1, and the marginal would sum up to one. Therefore, similar to the upgrade of weight in Neural Net, given a set of labeled queries in training phase, the learning algorithm descends in the direction of the total derivative, which is the sum of individual derivatives For a singled labeled instance <e,c>, the partial derivative is: P(r, f|e. c) - P(r,f |e) - θ r|f[ P(f| e, c)-P(f | e)] In the specific NB structure, the computation of derivative is relatively less intensive, because for each CPTable investigated, the parent of CPTable is the class label node; this special belief network brings great reduce in computation complexity in the implementation. There are also some other speed-up techniques, such as “line-search” to determine the learning rate, and conjugate gradient. We didn’t try these, but still took advantage of the observation that when R is independent of C given E, the derivative would be zero [5]. 4.3 TAN 4.3.1 Overview Tree Augmented Naïve Bayesian networks (TAN [1]) is one approach we took in this project. TAN’s are similar to regular NB’s, but the features of a TAN are organized into a tree structure. An example is given in figure 4.3.1. Class F1 F2 F3 F4 F5 Figure 4.3.1: An example of a tree augmented naïve Bayesian network
  13. 13. The CP tables of a TAN are also similar to those of a NB. The difference is that all of the CP tables for the feature nodes have an extra column to account for the extra parent, except for the root node of the tree. Figure 4.3.2 shows an example. Class Parent Fi = 0 Fi = 1 C1 0 .566 .444 C1 1 .200 .800 C2 0 .101 .899 C2 1 .750 .250 Figure 4.3.2: An example of a CP table for a TAN 4.3.2 Learning Structure Below is the algorithm we used to learn the TAN structure, taken from [1]. 1. Calculate the Conditional Mutual Information Ip between any two features F1 and F2, given the classification C. Ip(F1; F2 | C) = Σf1,f2,c P(F1=f1, F2=f2 | C=c) log P(F1=f1, F2=f2 | C=c) P(F1=f1 | C=c) P(F2=f2 | C=c) In our case we avoid zero cases by initializing the count buckets for the P(f1, f2 | c) to 1 instead of 0. 2. Construct a complete undirected graph, where every feature is a node in the graph. Set the weights of the edges in the graph to the corresponding Ip values between features. 3. Extract the maximum weighted spanning tree from the graph. In our case, we used Kruskal’s minimum weighted spanning tree algorithm [2] and modified it slightly to find the maximum weighted spanning tree. 4. Choose a node to be the root and direct all edges in the spanning tree away from it, creating a tree. In our case we chose the feature with the highest information gain to be the root node. 5. Add the classification node and make it a parent of all of the feature nodes. 4.3.3 Learning CP Parameters and Classification Given a data record with m features f0, f1, … , fm Class = argmaxc { P(c) Πi P(fi | p(fi), c) } Where 1 <= i <= m, and p(fi) is the value of feature fi’s parent. We consider the root node to be its own parent ( p(froot) = froot ).
  14. 14.  Class = argmaxc { P(c) Πi P(fi, p(fi), c) / P(p(fi), c) }  Class = argmaxc { nc Πi ncijk / ncik } nc is the number of records with class c ncijk is the number of records with class c, where fi = j and p(fi) = k ncik is the number of records with class c, where p(fi) = k We simply count up the nc, ncijk, and ncik’s to learn the CP table entries. Again, to avoid problems when these values are 0, we simply initialize all entries to 1. 4.3.4 Example Lets say we are given the data in table 4.3.1. First we determine the structure of the TAN. Step 1 is to calculate the conditional mutual information between every two features. Class F1 F2 F3 F4 A 1 0 0 0 A 0 1 0 1 B 1 0 1 0 B 0 1 1 0 B 0 0 1 1 B 0 0 1 1 B 1 1 1 1 B 0 0 1 0 Table 4.3.1: Data for the TAN example P(F1=0, F2=0 | C=A) = 1/6 P(F1=0, F2=1 | C=A) = 2/6 P(F1=1, F2=0 | C=A) = 2/6 P(F1=1, F2=1 | C=A) = 1/6 P(F1=0, F2=0 | C=B) = 4/10P(F1=0, F2=1 | C=B) = 2/10 P(F1=1, F2=0 | C=B) = 2/10P(F1=1, F2=1 | C=B) = 2/10 P(F1=0 | C=A) = 3/6 P(F1=1 | C=A) = 3/6 P(F1=0 | C=B) = 6/10 P(F1=1 | C=B) = 4/10 P(F2=0 | C=A) = 3/6 P(F2=1 | C=A) = 3/6 P(F2=0 | C=B) = 6/10 P(F2=1 | C=B) = 4/10 Note that we started each of the above buckets at 1 instead of 0 before counting. That explains why the denominators are 6 and 10, instead of 2 and 6 respectively.
  15. 15. So using the above values we get Ip(F1; F2 | C) = 1/6 log( 1/6 / (3/6 * 3/6) ) + 2/6 log( 2/6 / (3/6 * 3/6) ) + 2/6 log( 2/6 / (3/6 * 3/6) ) + 1/6 log( 1/6 / (3/6 * 3/6) ) + 4/10 log( 4/10 / (6/10 * 6/10) ) + 2/10 log( 2/10 / (6/10 * 4/10) ) + 2/10 log( 2/10 / (4/10 * 6/10) ) + 2/10 log( 2/10 / (4/10 * 4/10) ) = -0.0293485 + 0.0416462 + 0.0416462 + -0.0293485 + 0.018303 + -0.0158362 + -0.0158362 + 0.019382 Ip(F1; F2 | C) = 0.0306079 And similarly we get Ip(F1; F3 | C) = 0.0022286 Ip(F1; F4 | C) = 0.0245954 Ip(F2; F3 | C) = 0.0022286 Ip(F2; F4 | C) = 0.0245954 Ip(F3; F4 | C) = 0 Step 2 is to create a complete undirected graph where the features are the nodes and the Ip values are the edge weights. A graphical representation of this graph is shown in figure 4.3.3. In our implementation we represent the graph as an array of triplets <n1, n2, w> where n1 and n2 are the nodes that the edge connects and w is the weight of the edge. Graph = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246>, <2, 3, 0.0022>, <2, 4, 0.0246>, <3, 4, 0> } F 1 .0246 .0306 .0022 F F 2 .0246 4 .0022 0 F 3 Figure 4.3.3: The conditional mutual information graph for the TAN example. Step 3 is to extract a maximum weighted spanning tree from the graph. Our algorithm generates the following max span tree, also shown in figure 4.3.4.
  16. 16. MaxSpanTree = { <1, 2, 0.0306>, <1, 3, 0.0022>, <1, 4, 0.0246> } It is easy to verify that this is indeed a maximum weighted spanning tree. F 1 .0306 .0246 .0022 F F F 2 4 3 Figure 4.3.4: A maximum weighted spanning tree for the TAN example In step 4 we choose the feature with the highest information content to be the root node. The information contents of the features are given in Table 4.3.2. We see that feature F4 has the highest information content, so it becomes the root node. The following formula was used to calculate the information contents: Gain(F) = - P(F=0) log2( P(F=0) ) - P(F=1) log2( P(F=1) ) Step five involves simply adding the classification node as a parent to all other nodes. Figure 4.3.5 shows the final TAN structure. Feature Information Content F1 0.954434 F2 0.954434 F3 0.811278 F4 1.000000 Table 4.3.2: The information content of the features for the TAN example Now that the structure is set, we need to learn the CP table entries. The parameters required are the ncijk, ncik, and nc values described in section 4.3.2. Tables 4.3.3, 4.3.4, and 4.3.5 show the CP tables that contain the ncijk, ncik, and nc entries respectively, for our example. Again, remember that the ncijk entries were initialized to 1, not 0.
  17. 17. F 4 C F 1 F F 2 3 Figure 4.3.5: The final TAN structure in the TAN example Class P(F1) = F1 = 0 F1 = 1 Class P(F2) = F2 = 0 F2 = 1 F4 F1 A 0 1 2 A 0 1 2 A 1 2 1 A 1 2 1 B 0 3 2 B 0 4 2 B 1 3 2 B 1 2 2 Class P(F3) = F3 = 0 F3 = 1 Class P(F4) = F4 = 0 F4 = 1 F1 F4 A 0 2 1 A 0 2 1 A 1 2 1 A 1 1 2 B 0 1 5 B 0 4 1 B 1 1 3 B 1 1 4 Table 4.3.3: The ncijk CP table entries for the TAN example Class F1 = 0 F1 = 1 Class F2 = 0 F2 = 1 A 3 3 A 3 3 B 6 4 B 6 4 Class F3 = 0 F3 = 1 Class F4 = 0 F4 = 1 A 4 2 A 3 3 B 2 8 B 5 5 Table 4.3.4: The ncik CP table entries for the TAN example Class = A 6 Class = B 10
  18. 18. Table 4.3.5: The nc CP table entries for the TAN example Now that both the structure and the CP table entries have been learned, we can attempt to classify new instances. Consider the following unclassified record: Class F1 F2 F3 F4 ? 1 1 1 0 P(Class = A) = nA * (nA110 * nA211 * nA311 * nA400) / (nA40 * nA11 * nA11 * nA40) = 6 * (2 * 1 * 1 * 2) / (3 * 3 * 3 * 3) = 0.296 P(Class = B) = nB * (nB110 * nB211 * nB311 * nB400) / (nB40 * nB11 * nB11 * nB40) = 10 * (2 * 2 * 3 * 4) / (5 * 4 * 4 * 5) = 1.200 Therefore we classify this example as ‘B’ since P(Class = B) > P(Class = A). 3.3.4 Validation We validated our TAN implementation by running it with the above example and analyzing the verbose debugging output. We verified that the results from that run were identical to results given in the above example. 4.4 Neural Nets 4.4.1 Overview Artificial neural network learning provides a practical method for learning real- valued and vector-valued functions over continuous and discrete-valued attributes, in a way that is robust to noise in the training data. The Backpropagation algorithm [3] is the most common network learning method and has been successfully applied to a variety of learning tasks, such as handwriting recognition and robot control. Neural Nets is one of the major techniques covered in the class. 4.4.2 Implementation As opposed to the Naïve Bayes and TAN that we implemented from scratch, our implementation of Neural Nets is based on the our assignment 4 from class. We modified the Backpropagation algorithm that was originally for the problem of face recognition. The major modification we made was on the input: instead of using input nodes that represented the images, we changed it to represent every different features of our protein sequence. For the output nodes, instead of representing the user’s head position or user id, etc, we use them to represent the different class labels. Lastly, we changed the code for
  19. 19. estimating the classification accuracy since these two problems are totally different in this case. For initial value of each input node, our strategy is: If one feature appears in one particular sequence, then the initial value of that input node is 1, otherwise, it will be set to 0. Corresponding, for the output node, we set 1 to one of the 14 output nodes that represents the correct class of our current sequence, set 0 to others 13 output nodes. The unit weight is set up randomly in the beginning. 4.4.3 Example For a specific protein sequence, the number of input nodes will be the number of features. You can specify the number of hidden nodes as a parameter. The number of output nodes in all experiment is 14 since we have 14 different classes for all the dataset. Each output node represents one of the classes in {a, b, c, d, e, f, g, h, i, j, k, l, m, n}. Inputs Hidden Output . . . . Figure 4.4.1 Learned Hidden Layer Representation 4.5 Wrapper (Information Content) 4.5.1 Overview For our particular task, the data set scales up to thousands of features. Even worse, some of these features are irrelevant and provide little to no information. Also the features can be noisy. Standard algorithms do not scale well with number of features, so the approach we use is “Wrapper”: Try different subsets of features on learner, estimating performance of algorithm with respect to each subset, and keep subset that performs best. Before selecting the subset, we preprocess (weight) each feature according to its mutual information content given by the formula below.
  20. 20. W j = Σv Σc P(y = c, f j = v) log P(y = c, f j = v) P(y = c) P(f j = v) We can see that formula treats all the features independently. 4.5.2 Implementation Step 1: Calculate the information content of each feature We read in all of the training records first and then use the above formula to compute the mutual information content for each feature. When this preprocessing step is finished, we can begin to train the classifier in the next step. Step 2: Try different subsets of features We begin by using all of the features to train the classifier. Then we remove 5% of the features that have the lowest information contents and retrain the classifier, in each round. After 20 rounds, there are no features remaining. We compare the classification accuracies of these 20 rounds and choose the subset of features that produced the highest prediction accuracy. If two features have the same information content, then we choose one arbitrarily. 4.5.3 Example Let us consider the following problem: Suppose we have totally eight protein sequences, each sequence has exactly eight features. These eight sequences belong to all four classes: {C, P, R, M}. In the following table, for a particular protein sequence, if the entry of feature i is 1 then feature i appears, otherwise it does not appear. For example, for the first protein sequence, the features I, II and III appears in this sequence, the others do not appear. Seq. class I II III IV V VI VII VIII 1 C 1 1 1 0 0 0 0 0 2 C 1 1 1 1 0 0 0 0 3 P 0 0 1 0 0 1 0 1 4 P 0 0 1 0 0 0 1 1 5 R 1 1 1 0 0 0 0 0 6 R 0 1 1 0 0 1 0 0 7 M 0 0 1 0 1 1 0 0 8 M 0 1 1 1 0 0 0 0
  21. 21. Info. 0.352 0.352 0 0.156 0.147 0.102 0.147 0.406 Table 4.5.1 the information content of eight features in eight sequences The last row shows the information content of each feature. These values are computed by the formula given above. As we can see, feature #3’s information content is 0, which shows that this feature contains least information about the data. This is expected since it appears in all the eight sequences. On the other hand, whenever feature #8 appears, its corresponding class is P in our example. Therefore, it is a significant discriminating feature in the data. Accordingly, its information content is the highest one in this case. The wrapper works by training a classifier using all the features in the first round. In the following rounds, it removes a fixed number of features each round starting from those features with low information content. For example, if we decide to remove one feature at a time in our example, then we iterate through 8 rounds, starting with removing feature #3 since its information content is 0. Then removing feature #6 since 0.102 is the smallest among remaining features. In the last round, only feature #8 remains. We then choose the subset of features with highest accuracy appears during the eight rounds. 4.6 Other approaches 4.6.1 Overview Besides the primary techniques we implemented (Naïve Bayes, TAN, and Neural Nets), we also apply some others, which include both traditional techniques such as decision trees and ruler learners, and a more recent approach in SVM’s.  A decision tree is a class discriminator that recursively partitions the training sets until each partition consists entirely or dominantly of examples from one class. Each non-leaf node of the tree contains a split point that is a test on one of more features and determines how the data is partitioned. It is the first classifier we learned in our class.  A rule learner is alternative classifier, which can be built directly by reading off a decision tree: generating a rule for each leaf and making a conjunction of all the tests encountered on the path from the root to that leaf. The advantage of rule learner is of its easy understanding, but sometimes it becomes more complex than necessary.
  22. 22.  SVM (Support Vector Machine) is a method for creating functions from a set of labeled training data. The function can be a classification function or the function can be a general regression function. For classification, SVM operates by finding a hyper-surface in the space of possible inputs. This hyper-surface will attempt to split the positive examples from negative examples. The split will be chosen to have the largest distance from the hyper-surface to the nearest of the positive and negative examples. Intuitively, this makes the classification correct for testing data that is near, but not identical to the training data. They are maturely used in the NLP (Natural Language Processing) problem such as text categorization. 4.6.2 Existing Tools Instead of implementing all of the classifiers by ourselves, we chose to use some existing machine learning tools to make life easier.  WEKA Both Decision Tree and Rule Learner classifiers are used through WEKA. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. It includes almost all of the existing classification schemes. It has decision trees, rule learners and naïve Bayes. However, we will show in the next section that WEKA does not seem capable of dealing with our datasets very well.  Libsvm Libsvm is a simple, easy-to-use, and efficient software for SVM classification and regression. Although WEKA has the SVM classifier, it only deals with binary classifications, which is inappropriate for our task since we have 14 classes in our datasets. The most appealing feature of Libsvm is that it supports multi-class classification. In addition, it can solve C-SVM classification, nu-SVM classification, one-class-SVM, epsilon-SVM regression, and nu-SVM regression.
  23. 23. 5. Empirical Analysis 5.1 Experiment Setup 5.1.1 Background on the Data Set Our three data sets wer provided by the PENCE group at the University of Alberta. Each data set contains thousands of protein sequences with known classes. For each sequence, there are more than one thousand features. For example, the Ecoli data set has more than two thousand sequences with about 1500 features. See table 5.1.1. Data Set # of classes # of sequences # of features Ecoli 14 2370 1504 Yeast 14 2539 1555 Fly 14 3823 1906 Table 5.1.1 the three data sets: Ecoli, Yeast, Fly 5.1.2 Training and Testing We train the classifier on each of the three datasets separately with different techniques. We use 5-fold cross validation to compute the validation accuracy. We implement Naïve Bayes, TAN and Neural Nets using C. The WEKA code is implemented using JAVA. The Libsvm has both C and JAVA version, we simply use C version during our experimentation. All the experimentation is on the machine at our graduate office, which is an i686 machine running on Linux 7.0 and has 415MB swap memory. 5.2 Comparison of NB, TAN, and NN Figure 5.1.1 shows a comparison of Naïve Bayesian, Tree Augmented Naïve Bayesian, and Neural Net classifiers. The accuracies given were obtained using 5-fold cross validation.
  24. 24. Comparison of diff. classifiers without wrapper 100 Validation accuracy 80 NB 60 TAN ` 40 NN 20 0 Ecoli Yeast Fly Different classifier techniques Comparison of diff. classifiers with wrapper Best validation accuracy 100 80 NB 60 TAN 40 NN 20 0 Ecoli Yeast Fly Different classifier techniques Effect of feature selection on different classifiers Ecoli NB Yeast TAN NN Fly 0 4 8 12 16 20 Percentage of accuracy improvement Figure 5.1.1: A comparison of the accuracy of NB, TAN, and NN. The first graph shows the comparative accuracies without using the wrapper. The second graph shows the maximum accuracy of each method using the wrapper. The third graph shows the increase in accuracy when the wrapper is used. We see that the accuracies of the NB and the TAN classifiers are roughly equal both with and without the wrapper, for all three data sets. Given that TAN’s are more complicated to implement and take longer to train than NBs
  25. 25. [1], it is likely more practical to use NB’s for the PENCE data rather than TAN’s. Neural networks perform noticeably better than both NB’s and TAN’s in terms of accuracy, on all three datasets. This suggests that neural network classifiers could be a promising area for future research for the Proteome Analyst tool. The third graph shows the accuracy improvement percentage obtained by using the wrapper. We note that the wrapper seems to have a similar effect on both NB and TAN classifiers, while the wrapper does not help in the NN case at all for the yeast and fly datasets. 5.3 Generative vs. Discriminative The first observation is that discriminative learning will enhance the classification accuracy. R.Greiner and W.Zhou have proved the discriminative learning would be more robust to incorrect assumption than generative learning [5]. The second observation is that discriminative learning is more computational intensive than generative learning, since it will update every entry in CPTable each time, and deal with high-dimension in this case. 5.4 Feature Selection—Wrapper As we observed before, each protein sequence of our data set has more than a thousand features; therefore, we use the “wrapper” feature selection technique to remove less relevant features. The following graphs show how the wrapper works on our three implementations: Naïve Bayes, TAN and Neural Nets. From figure 5.3.1, we can see that:  For both Naïve Bayes and TAN, we see that wrapper helps a lot. When removing 75%--85% percentages of features, both of them gain best classification accuracy. As we can see, when remains 25% of features, Naïve Bayes classifier achieve an accuracy close to eighty, which is about 15% higher than it use all the features to train a classifier.  But for the Neural Nets, wrapper only works on the Ecoli, and the help is not that significant compared with running on Naïve Bayes or TAN. For the other two data sets (Yeast and Fly), wrapper does not help at all. The accuracy consistently decreases as the number of features goes down.
  26. 26. NB classifier with kfold = 5 100 Validation accuracy 80 Ecoli 60 Yeast 40 Fly 20 0 30 0 10 20 40 50 60 70 80 90 100 Percentage of tokens removed TAN classifier with kfold = 5 100 Validation accuracy 80 Ecoli 60 Yeast 40 Fly 20 0 0 10 20 30 40 50 60 70 80 90 100 Percentage of tokens removed NN classifier with kfold = 5 100 Validation accuracy 80 Ecoli 60 Yeast 40 Fly 20 0 0 100 10 20 30 40 50 60 70 80 90 Percentage of tokens removed Figure 5.4.1: The effect of using wrapper as feature selection. The first graph shows how wrapper works on NB. The second graph shows how wrapper works on TAN. The third graph shows how wrapper works on the neural nets.
  27. 27. 5.5 Miscellaneous Learning Algorithms We experiment with four other approaches using existing tools, WEKA and Libsvm, and record the 5-fold cross-validation accuracy. Additionally, since the WEKA code includes the Naïve Bayes classifier we compare their tool and our implementation. In the following table, some entries are empty. There are two reasons for this. One is that the training time is too long. For example, for the rule trainer classifier, it takes nearly 6 hours to train the Ecoli classifier. Since the Yeast dataset has more records and more features, it becomes impractical to continue. Tech. Data Ecoli Yeast Fly Decision Tree 81.9% 79.4% -- Rule Learner 82.66% -- -- Naïve Bayes 67.85% 69.16% -- SVM 85.3165% 82.4734% 78.0016% Table 5.5.1 the validation accuracy using some other techniques The second reason for the blank entry is that when using the WEKA code for Fly, we run out of memory. Since we cannot modify the WEKA code, we ignore those experiments. Although we cannot complete some of the tests using existing tools, we can still gain some useful insight from the results we did get.  The accuracy of Naïve Bayes of WEKA not only validates the correctness of our implementation, but also illustrates the strength of our own implementation, which can deal with all of three data sets, without running out of memory.  Naïve Bayes is the worst classifier if we only considering the accuracy. All the other three techniques achieve close to 80% high accuracy, which is about 10% higher than NB. However, the execution time of those techniques is much higher and is shown in the later sections.  SVM (Support Vector Machine) technique not only deals with all of three data sets, but also is the winner among those techniques with respect to accuracy. For Ecoli, it achieves the highest 85.3% accuracy, which is about 20% better than Naïve Bayes. This makes it a potential alternative for the Naïve Bayes classifier though it still consumes more execution time than Naïve Bayes.
  28. 28. 5.6 Computational Efficiency As we saw before, the Naïve Bayes classifier is not as accurate as the other methods, but we believe it is the most practical classifier for our task. The reason can be easily seen from the following table: Classifier Naïve TAN Neural Decision Rule SVM Bayes Nets Tree Learner Time 5mins 15mins 30mins 1hr 6hrs 12mins Table 5.6.1 the approximate execution times of different techniques on Ecoli We conclude in the last section that nearly all other classifier outperform Naïve Bayes with respect to accuracy. The table above suggests an interesting tradeoff: More accuracy, longer time. For those classifiers that take more than half an hour like Decision Tree, they can never be considered as a practical approach for our task. For the others, if our goal focuses on classification accuracy, then our study shows that both TAN and SVM will be good choice. Especially, for the SVM, as we see before, it can outperform 20% more accuracy than Naïve Bayes, but it also takes twice as long to train the classifier. Overall, when we consider both our criteria, Naïve Bayes still seems to be the optimal classifier for our task, currently. However, TAN’s and SVM’s look to be excellent areas of future research in this area, especially research done to improve their training speed.
  29. 29. 6. Conclusions and Future Works 6.1 Conclusions In this course project, we have explored several machine learning techniques for classification on a specific application domain – PENCE. Though our main focus is on Bayesian Network classifiers (Naïve Bayesian, TAN), we have also tried other different ideas (Decision Tree, Neural Network, SVM, etc). What’s more, discriminative learning on Naïve Bayesian is also tested. Comparison both on classification accuracy and running efficiency in terms of execution time are drawn based on various different combinations of experiments and from different angles. Based on the experimental results we have, we found that the harder a learner learns (in terms of execution time), the better results we can get (in terms of classification accuracy). However, this is the trade-off between efficiency and accuracy. Take the all factors into account, we think NB+wrapper is a suitable solution to this application. However, we are impressive by the accuracy SVM works out. 6.2 Future Work One of the possible future work can be carried out on the feature selection part, since wrapper works quite effectively. There are many other algorithms dealing with scaling up the supervised learning. From the last class this term, several algorithms, such as “RELIEF-F” algorithm, which draw samples at random, and then adjust weight of features that discriminate instances from neighbors of different classes; “VSM”, which integrates feature weighting into learning algorithm, etc have been introduced and can be tried. One of the other possible way in reducing the feature dimensionality can take advantage of some statistic metrics and clustering techniques to cluster the feature sets first, and then do the learning task. Considering the long execution time for almost all the algorithms except Naïve Bayesian Network, speed-up in learning phase according to different algorithms can be another aspect of future work.
  30. 30. Acknowledgments The authors are grateful to Dr. Russ Greiner for his valuable comments on our project and useful discussions relating to this work. Jie Cheng and Wei Zhou’s previous work on Bayesian Network and discriminative learning helps our work a lot. We also thank Dr. Duane Szafron, and Dr. Paul Lu for their support with regard to the PENCE code and data. We also would like to thank Roman Eisner for helping us on some detailed problems. And perhaps most of all, we would like to thank the good people at Wendy’s for providing us with tasty hamburgers at a reasonable price during the ungodly hours of the night while we worked late.
  31. 31. 7. References 1. Friedman, Geiger, Goldszmidt. Bayesian Network Classifiers. Machine Learning, volume 29 (pp. 131-163), 1997. 2. Brassard and Bratley. Fundamentals of Algorithmics, Prentice Hall, 1996. 3. T Mitchell. Machine Learning, McGraw Hill, 1997. 4. Jie Cheng and Russell Greiner. Comparing Bayesian Network Classifiers. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-99), Sweden, Aug 1999. 5. Russ Greiner, and Wei Zhou. Structural Extension to Logistic Regressions: Discriminative Parameter Learning of Belief Net Classifiers. AAAI’02, Canada 6. David E. Heckerman. A tutorial on learning with Bayesian networks. Learning in Graphical Models, 1998
  32. 32. 8. Appendix Naïve Bayes (Generative) Naïve Bayes (Discriminative) Percentage of tokens removed Ecoli Yeast Fly Ecoli Yeast Fly 0 67.8 69.1 68.3 5 68.5 69.3 69.0 10 69.0 69.5 69.2 15 69.4 69.3 69.4 20 70.1 70.0 69.8 25 70.7 70.3 70.1 30 71.3 70.7 70.4 35 71.8 70.9 70.9 40 72.4 70.8 70.9 45 73.9 71.1 71.4 50 74.5 71.2 71.2 55 75.0 71.4 71.1 60 75.6 71.6 71.1 65 76.1 71.5 70.9 70 76.6 71.3 70.0 75 77.3 71.6 69.5 80 77.1 71.1 69.1 85 77.0 71.2 68.1 90 75.9 69.2 65.7 95 71.7 66.1 61.7 99 40.4 60.2 39.78 100 0 0 0 Table 1: empirical result (Accuracy) of 2 approaches to learning classifier with wrapper, over 3 datasets.
  33. 33. TAN Neural Nets Percentage of tokens removed Ecoli Yeast Fly Ecoli Yeast Fly 0 67.8 69.4 68.7 85.7 87.1 76.3 5 68.3 69.7 69.0 89.4 86.7 73.9 10 69.0 70.1 69.3 88.8 84.5 72.0 15 69.4 70.3 69.8 86.6 83.4 74.9 20 70.0 70.7 70.2 82.7 82.4 68.3 25 70.6 70.8 70.4 84.5 78.8 69.0 30 71.3 71.0 70.6 78.9 78.4 67.7 35 71.7 71.3 71.0 78.7 77.4 64.3 40 72.4 71.3 71.4 68.3 75.1 58.5 45 73.8 71.6 71.7 72.1 73.7 56.7 50 74.4 71.6 71.5 68.2 71.8 54.8 55 74.8 71.9 71.4 64.7 69.2 51.2 60 75.5 72.1 71.2 68.5 63.1 45.7 65 76.1 72.2 71.2 65.3 63.3 50.8 70 76.6 71.7 70.3 63.4 58.8 43.1 75 77.2 72.0 70.0 59.4 51.8 42.7 80 77.0 71.4 69.3 55.8 47.5 37.5 85 77.0 71.1 68.8 49.3 42.6 28.0 90 75.9 69.4 66.3 33.2 30.2 25.6 95 71.7 66.6 62.3 20.1 26.6 24.3 99 40.9 60.3 40.1 13.8 16.3 11.0 100 0 0 0 0 0 0 Table 2: empirical result (Accuracy) of 2 approaches to learning classifier with wrapper, over 3 datasets.