Comparing Naive Bayesian and k-NN algorithms for automatic ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Comparing Naive Bayesian and k-NN algorithms for automatic ...

  1. 1. Comparing Naïve Bayesian and k-NN algorithms for automatic email classification Louis Eisenberg Stanford University M.S. student PO Box 18199 Stanford, CA 94309 650-269-9444 ABSTRACT many users are the beneficiaries of machine learning algorithms that attempt to distinguish spam from non-spam (e.g. SpamAssassin [2]). In The problem of automatic email classification contrast to the relative simplicity of spam has numerous possible solutions; a wide variety filtering – a binary decision – filing messages of natural language processing algorithms are into many folders can be fairly challenging. The potentially appropriate for this text classification most prominent non-commercial email classifier, task. Naïve Bayes implementations are popular POPFile, is an open-source project that wraps a because they are relatively easy to understand user-friendly interface around the training and and implement, they offer reasonable classification of a Naïve Bayesian system. My computational efficiency, and they can achieve personal experience with POPFile suggests that it decent accuracy even with a small amount of can achieve respectable results but it leaves training data. This paper seeks to compare the considerable room for improvement. In light of performance of an existing Naïve Bayesian the conventional wisdom in NLP research that k- system, POPFile [1], to a hand-tuned k-nearest NN classifiers (and many other types of neighbors system. Previous research has algorithms) should be able to outperform a Naïve generally shown that k-NN should outperform Bayes system in text classification, I adapted Naïve Bayes in text classification. My results fail TiMBL [3], a freely available k-NN package, to to support that trend, as POPFile significantly the email filing problem and sought to surpass outperforms the k-NN system. The likely the accuracy obtained by POPFile. explanation is that POPFile is a system specifically tuned to the email classification task DATA that has been refined by numerous people over a period of years, whereas my k-NN system is a crude attempt at the problem that fails to exploit I created the experimental dataset from my own the full potential of the general k-NN algorithm. inbox, considering the more than 2000 non-spam messages that I received in the first quarter of INTRODUCTION 2004 as candidates. Within that group, I selected approximately 1600 messages that I felt confident classifying into one of the twelve Using machine learning to classify email “buckets” that I arbitrarily enumerated (see Table messages is an increasingly relevant problem as 1). I then split each bucket and allocated half of the rate at which Internet users receive emails the messages to the training set and half to the continues to grow. Though classification of test set. desired messages by content is still quite rare,
  2. 2. As input to POPFile, I kept the messages in necessary operations. To train the classifier I fed Eudora mailbox format. For TiMBL, I had to the mbx files (separated by category) directly to convert each message to a feature vector, as the provided utility script For testing, I described in section 3. split each test set mbx file into its individual messages, then used a simple Perl script fed the messages one at a time to the provided script Cod Size* Description, which reads in a message and outputs the e * academic events, talks, seminars, etc. same message with POPFile’s classification ae 86 bslf 63 buy, sell, lost, found decision prepended to the Subject header and/or c 145 courses, course announcements, etc. added in a new header called X-Test- hf 43 humorous forwards Classification. After classifying all of the na 37 newsletters, articles messages, I ran another Java program, personal popfilescore, to tabulate the results and generate p 415 politics, advocacy a confusion matrix. pa 53 se 134 social events, parties s 426 sports, intramurals, team-related ua 13 University administrative k-NN w 164 websites, accounts, e-commerce, support wb 36 work, business To implement my k-NN system I used the * - training and test combined Tilburg Memory-Based Learner, a.k.a. TiMBL. I Table 1. Classification buckets installed and ran the software on various Unix- based systems. TiMBL is an optimized version of the basic k-NN algorithm, which attempts to classify new instances by seeking “votes” from POPFILE the k existing instances that are closest/most similar to the new instance. The TiMBL POPFile implements a Naïve Bayesian algorithm. reference guide [5] explains: Naïve Bayesian classification depends on two crucial assumptions (both of which are results of Memory-Based Learning (MBL) is based on the the single Naïve Bayes assumption of conditional idea that intelligent behavior can be obtained by independence among features as described in analogical reasoning, rather than by the Manning and Schutze [4]): 1. each document can application of abstract mental rules as in rule be represented as a bag of words, i.e. the order induction and rule-based processing. In and syntax of words is completely ignored; 2. in particular, MBL is founded in the hypothesis that a given document, the presence or absence of a the extrapolation of behavior from stored given word is independent of the presence or representations of earlier experience to new absence of any other word. Naïve Bayes is thus situations, based on the similarity of the old and incapable of appropriately capturing any the new situation, is of key importance. conditional dependencies between words, guaranteeing a certain level of imprecision; Preparing the messages to serve as input to the k- however, in many cases this flaw is relatively NN algorithm was considerably more difficult minor and does not prevent the classifier from than in the Naïve Bayes case. A major challenge performing well. in using this algorithm is deciding how to represent a text document as a vector of features. To train and test POPFile, I installed the software I chose to consider five separate sections of each on a Windows system and then used a email: the attachments, the from, to and subject combination of Java and Perl to perform the headers, and the body. For attachments each
  3. 3. feature was a different file type, e.g. jpg or doc. overlap (basic equals or not equals for For the other four sections, each feature was an each feature), modified value difference email address, hyperlink URL, or stemmed and metric (MVDM), and Jeffrey divergence lowercased word or number. I discarded all other • d, the class vote weighting scheme for headers. I also ignored any words of length less neighbors; this can be simple majority (all than 3 letters or greater than 20 letters and any have equal weight) or various words that appeared on POPFile’s brief alternatives, such as Inverse Linear and stopwords list. All together this resulted in each Inverse Distance, that assign higher document in the data set being represented as a weight to those neighbors that are closer vector of 15,981 features. For attachments, to the instance subject, and body, I used tf.idf weighting according to the equation: For distance metrics, MVDM and Jeffrey divergence are similar and, on this task with its numeric feature vectors, both clearly preferable weight(i,j) = (1+log(tfi,j))log(N/dfi) iff tfi,j ≥ 1, to basic overlap, which draws no distinction between two values that are almost but not quite where i is the term index and j is the document equivalent and two values that are very far apart. index. For the to and from fields, each feature The other options have no clearly superior setting was a binary value indicating the presence or a priori, so I relied on the advice of the TiMBL absence of a word or email address. reference guide and the results of my various trial The Java program mbx2featurevectors parses the runs. training or test set and generates a file containing all of the feature vectors, represented in TiMBL’s RESULTS/CONCLUSIONS Sparse format. The confusion matrices for POPFile and for the TiMBL processes the training and test data in most successful TiMBL run are reproduced in response to a single command. It has a number of Tables 2 and 3. Figure 4 compares the accuracy command-line options with which I scores of the two algorithms on each category. experimented in an attempt to extract better Table 5 lists accuracy scores for various accuracy. Among them: combinations of TiMBL options. The number of • k, the number of neighbors to consider TiMBL runs possible was limited considerably when classifying a test point: the by the length of time that each run takes – up to literature suggests that anywhere between several hours even on a fast machine, depending one and a handful of neighbors may be greatly on the exact options specified. optimal for this type of task • w, the feature weighting scheme: the classifier attempts to learn which features have more relative importance in determining the classification of an instance; this can be absent (all features get equal weight) or based on information gain or other slight variations such as gain ratio and shared variance • m, the distance metric: how to calculate the nearness of two points based on their features; options that I tried included
  4. 4. ae bs c hf na p pa se s ua w wb Table 2. Confusion matrix for best TiMBL run ae 3 0 0 0 0 1 0 25 14 0 0 0 bs 0 5 0 0 0 3 0 4 19 0 0 0 c 0 1 38 0 0 12 0 8 13 0 0 0 ae bs c hf na p pa se s ua w wb hf 0 1 0 5 0 10 0 0 5 0 0 0 ae 38 0 1 0 0 0 0 0 2 0 2 0 na 1 1 0 0 5 11 0 0 0 0 0 0 bs 0 10 0 0 0 0 0 0 21 0 0 0 p 0 0 0 2 0 189 0 0 15 0 1 0 8 3 51 0 0 4 1 0 2 1 0 0 pa 0 0 0 0 0 2 13 6 5 0 0 0 0 0 0 7 0 7 1 1 4 0 0 0 se 0 2 0 1 0 8 0 27 29 0 0 0 na 0 0 0 1 32 0 0 0 0 0 0 0 s 0 1 0 0 0 28 0 6 178 0 0 0 0 10 3 8 0 140 2 7 20 0 4 4 ua 0 0 0 0 0 1 0 0 0 5 0 0 pa 3 1 0 0 0 0 18 0 2 0 1 0 w 2 0 0 0 0 41 0 0 12 0 27 0 se 0 5 2 1 0 3 0 33 20 0 0 0 w b 0 0 0 0 0 18 0 0 0 0 0 0 0 14 3 2 0 15 0 2 173 0 0 3 ua 0 0 0 0 0 0 0 0 0 6 0 0 1 0 7 0 0 4 1 2 4 2 59 0 wb 0 0 0 1 0 2 0 0 0 0 0 14 Table 3. Confusion matrix for POPFile As the tables and figure indicate, POPFile clearly outperformed even the best run by TiMBL. POPFile’s overall accuracy was 72.7%, compared to only 61.1% for the best TiMBL trial. In addition, POPFile’s accuracy was well over 60% in almost all of the categories; by contrast, the k-NN system only performed well in three categories. Interestingly, it performed best in the two largest categories, personal and sports – in fact, it was more accurate than POPFile. Apparently it succeeded in distinguishing those categories from the rest of the buckets and from each other, but failed to pick up on most of the other important differences across buckets.
  5. 5. m w k d accuracy wb MVDM gain ratio 9 inv. dist. 51.0% w overlap none 1 majority 54.9% ua overlap inf. gain 15 inv. dist. 53.7% s MVDM shared var 3 inv. linear 61.1% se Jeffrey shared var 5 inv. linear 60.2% pa TiMBL overlap shared var 9 inv. linear 58.9% p na POPFile MVDM gain ratio 21 inv. dist. 49.4% MVDM inf. gain 7 inv. linear 57.4% hf MVDM shared var 1 inv. dist. 61.0% c bs MVDM shared var 5 majority 54.6% ae Table 4. Sample of TiMBL trials 0% 20% 40% 60% 80% 100% Figure 1. Accuracy by category OTHER RESEARCH The various TiMBL runs provide evidence for a A vast amount of research already exists on this few minor insights about how to get the most out and similar topics. Some people, e.g. Rennie et al of the k-NN algorithm. The overwhelming [6], have investigated ways to overcome the conclusion is that shared variance is far superior faulty Naïve Bayesian assumption of conditional to the other weighting schemes for this task. independence. Kiritchenko and Matwin [7] found Based on the explanation given in the TiMBL that support vector machines are superior to documentation, this performance disparity is Naïve Bayesian systems when much of the likely a reflection of the ability of shared training data is unlabeled. Other researchers have variance (and chi-squared, which is very similar) attempted to use semantic information to improve to avoid a bias toward features with more values accuracy [8]. – a significant problem with gain ratio. The In addition to the two models discussed in this results also suggest that k should be a small paper, there exist many other options for text number – the highest values of k gave the worst classification: support vector machines, results. The effect of the m and d options is maximum entropy and logistic models, decision unclear, though simple majority voting seems to trees and neural networks, for example. perform worse than inverse distance and inverse linear. REFERENCES It is also important to recognize the impact of the original construction of the feature vectors. [1] POPFile: Perhaps the k-NN system’s poor performance [2] SpamAssassin: was a result of unwise choices in [3] TiMBL: mbx2featurevector: focusing on the wrong [4] Manning, Christopher and Hinrich Schutze. Foundations of headers, not parsing symbols and numbers as Statistical Natural Language Processing. 2000. elegantly as possible, not trying a bigram or [5] TiMBL reference guide: trigram model on the message body, choosing a poor tf.idf formula, etc. [6] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning. 2003.
  6. 6. [7] Svetlana Kiritchenko and Stan Matwin. Email classification [8] Nicolas Turenne. Learning Semantic Classes for Improving with co-training. Proceedings of the 2001 conference of the Email Classification. Biométrie et Intelligence. 2003. Centre for Advanced Studies on Collaborative Research. 2001.