A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

  1. 1. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Universitat Dortmund, Fachbereich Informatik, Lehrstuhl 8 Baroper Str. 301 44221 Dortmund, Germany thorsten@ls8.informatik.uni-dortmund.de Abstract the algorithm is intuitive, it has a number of problems which - as I will show - lead to comparably low clas- The Rocchio relevance feedback algorithm is si cation accuracy: 1 The objective of the Rocchio one of the most popular and widely applied algorithm is to maximize a particular functional in- learning methods from information retrieval. troduced in section 3.2.1. Nevertheless Rocchio does Here, a probabilistic analysis of this algo- not show why maximizing this functional should lead rithm is presented in a text categorization to a high classi cation accuracy. 2 Heuristic compo- framework. The analysis gives theoretical nents of the algorithm o er many design choices and insight into the heuristics used in the Roc- there is little guidance when applying this algorithm chio algorithm, particularly the word weight- to a new domain. 3 The algorithm was developed ing scheme and the similarity metric. It also and optimized for relevance feedback in information suggests improvements which lead to a prob- retrieval; it is not clear which heuristics will work best abilistic variant of the Rocchio classi er. The for text categorization. Rocchio classi er, its probabilistic variant, The major heuristic component of the Rocchio algo- and a naive Bayes classi er are compared rithm is the TFIDF term frequency inverse docu- on six text categorization tasks. The results ment frequency Salton, Buckley, 1988 word weight- show that the probabilistic algorithms are ing scheme. Di erent avors of this heuristic lead to a preferable to the heuristic Rocchio classi er multitude of di erent algorithms. Due to this heuristic not only because they are more well-founded, this class of algorithms will be called TFIDF classi ers but also because they achieve better perfor- in the following. mance. A more theoretically founded approach to text cate- gorization provide naive Bayes classi ers. These algo- 1 Introduction rithms use probabilistic models for classi cation and allow the explicit statement of simplifying assump- Text categorization is the process of grouping docu- tions. ments into di erent categories or classes. With the The contribution of this paper is a probabilistic ana- amount of online information growing rapidly, the lysis of a TFIDF classi er. This analysis makes the need for reliable automatic text categorization has in- implicit assumption of the TFIDF classi er as explicit creased. Text categorization techniques are used, for as for the naive Bayes classi er. Furthermore it pro- example, to build personalized netnews lter which vides insight into how the TFIDF algorithm can be im- learn about the news-reading preferences of a user proved, leading to a probabilistic version of the TFIDF Lang, 1995 . They are used to index news stories algorithm, called PrTFIDF. PrTFIDF optimizes the Hayes et al., 1988 or guide a user's search on the di erent design choices of the TFIDF algorithm as a World Wide Web Joachims et al., 1997 . whole and gives clear recommendations on how to set One of the most widely applied learning algorithms the parameters involved. Empirical results on six cate- for text categorization is the Rocchio relevance feed- gorization tasks show that PrTFIDF not only enables back method Rocchio, 1971 developed in information a better theoretical understanding of the TFIDF al- retrieval. Originally designed for optimizing queries gorithm, but also performs better in practice without from relevance feedback, the algorithm can be adapted being conceptually or computationally more complex. to text categorization and routing problems. Although
  2. 2. 0 baseball 3.1 Representation From: xxx@sciences.sdsu.edu 3 0 specs graphics The representation of a problem has a strong impact on the generalization accuracy of a learning system. Newsgroups: comp.graphics Subject: Need specs on Apple QT 1 references I need to get the specs, or at least a 0 hockey For categorization a document, which typically is a string of characters, has to be transformed into a rep- very verbose interpretation of the specs, for QuickTime. Technical articles from 0 car resentation which is suitable for the learning algorithm magazines and references to books would be nice, too. 0 clinton . I also need the specs in a fromat usable . . and the classi cation task. IR research suggests that words work well as representation units and that their on a Unix or MS-Dos system. I can’t do much with the QuickTime stuff they 1 unix ordering in a document is of minor importance for have on ... 0 space 2 quicktime many tasks. This leads to a representation of docu- ments as bags of words. 0 computer Figure 1: Bag-of-words representation in an attribute This bag-of-words representation is equivalent to an value style. attribute-value representation as used in machine learning. Each distinct word corresponds to a feature with the number of times the word occurs in the doc- ument as its value. Figure 1 shows an example feature This paper is structured as follows. Section 2 in- vector for a particular document. To avoid unneces- troduces the de nition of text categorization used sarily large feature vectors words are considered as fea- throughout this paper. A TFIDF classi er and a naive tures only if they occur in the training data at least Bayes classi er are described in section 3. Section 4 m e.g. m = 3 times. The set of considered features presents the probabilistic analysis of the TFIDF clas- i.e. words will be called F. si er and states its implications. Empirical results and conclusions can be found in sections 5 and 6. 3.2 Learning Algorithms 3.2.1 TFIDF Classi er 2 Text Categorization This type of classi er is based on the relevance feedback algorithm originally proposed by Rocchio The goal of text categorization is the classi cation Rocchio, 1971 for the vector space retrieval model of documents into a xed number of prede ned cat- Salton, 1991 . Due to its heuristic components, there egories. The working de nition used throughout this are a number of similar algorithms corresponding to paper assumes that each document d is assigned to ex- the particular choice of those heuristics. The three actly one category. To put it more formally, there is a main design choices are set of classes C and a set of training documents D
  3. 3. D. Furthermore, there is a target concept T : D ! C the word weighting method which maps documents to a class. Td is known for the document length normalization the documents in the training set. Through super- vised learning the information contained in the train- the similarity measure. ing examples can be used to nd a model or hypothesis H : D ! C which approximates T. Hd is the func- An overview of some heuristics is given in tion de ning the class to which the learned hypothesis Salton, Buckley, 1988 . In the following the most assigns document d; it can be used to classify new doc- popular combination will be used known as tfcquot;: uments. The objective is to nd a hypothesis which tfquot; word weights Salton, Buckley, 1988 , document maximizes accuracy, i.e. the percentage of times H length normalization using Euclidian vector length and and T agree. cosine similarity. Originally developed for information retrieval, the algorithm returns a ranking of documents without 3 Learning Methods for Text providing a threshold to de ne a decision rule for Categorization class membership. Therefore the algorithm has to be adapted to be used for text categorization. The variant This section describes the general framework for the presented here seems to be the most straightforward experiments presented in this paper and de nes the adaptation of the Rocchio algorithm to text catego- particular TFIDF classi er and the naive Bayes classi- rization and domains with more than two categories. er used. The TFIDF classi er provides the basis for The algorithm builds on the following representation the analysis in section 4. of documents. Each document d is represented as a
  4. 4. ~ vector d = d1; : : :; djF j so that documents with Nevertheless it is unclear if or how maximizing this similar content have similar vectors according to a functional connects to the accuracy of the resulting xed similarity metric. Each element di represents classi er. a distinct word wi . di for a document d is calcu- The resulting set of prototype vectors, one vector for lated as a combination of the statistics TFwi; d and each class, represents the learned model. This model DFwi Salton, 1991 . The term frequency TFwi; d can be used to classify a new document d0. Again is the number of times word wi occurs in document ~ the document is represented as a vector d0 using the d and the document frequency DFwi is the number scheme described above. To classify d 0 the cosines of of documents in which word wi occurs at least once. ~ the prototype vectors cj with d0 are calculated. d0 is ~ The inverse document frequency IDFwi can be cal- assigned to the class with which its document vector culated from the document frequency. has the highest cosine.
  5. 5. jDj IDFwi = log DF w 1 c ~ HTFIDF d0 = argmax cos~j ; d0 C 2C 5 i j argmaxfx returns the argument x for which fx Here, jDj is the total number of documents. Intu- is maximum and HTFIDF d0 is the category to which itively, the inverse document frequency of a word is the algorithm assigns document d0. The algorithm can low if it occurs in many documents and is highest if be summarized in the following decision rule: the word occurs in only one. The so-called weight di ~0 of word wi in document d is then HTFIDF d0 = argmax jjcj jj d0 ~ C 2C cj jjd jj ~ ~ 6 di = TF wi; d IDF wi 2 j PjF j ci d0i This word weighting heuristic says that a word wi is = argmax qiPjFjj i C 2C =1 7 an important indexing term for document d if it occurs j i=1 cj 2 frequently in it the term frequency is high. On the In 7 the normalization with the length of the docu- other hand, words which occur in many documents are ment vector is left out since it does not in uence the rated less important indexing terms due to their low argmax. inverse document frequency. Learning is achieved by combining document vectors 3.2.2 Naive Bayes Classi er into a prototype vector cj for each class Cj . First, ~ both the normalized document vectors of the positive The classi er presented in this section uses a proba- examples for a class as well as those of the negative bilistic model of text. Although this model is a strong examples for a class are summed up. The prototype simpli cation of the true process by which text is gen- vector is then calculated as a weighted di erence of erated, the hope is that it still captures most of the each. important characteristics. ~ 1 X d , cj = jC j ~ 1 X d ~ In the following word-based unigram models of text j ~ jjdjj~ jD , Cj j ~ ~ 3 jjdjj will be used, i.e. words are assumed to occur indepen- dently of the other words in the document. There are d2Cj d2D,Cj jCj such models, one for each category. All documents and are parameters that adjust the relative impact assigned to a particular category are assumed to be of positive and negative training examples. As recom- generated according to the model associated with this mended in Buckley et al., 1994 = 16 and = 4 category. will be used in the following. Cj is the set of train- ~ ing documents assigned to class j and jjdjj denotes the The following describes one approach to estimating ~. Additionally Rocchio PrCj jd0, the probability that a document d0 is in class Euclidian length of a vector d Cj . Bayes' rule says that to achieve the highest clas- requires that negative elements of the vector cj are si cation accuracy, d0 should be assigned to the class set to 0. Using the cosine as a similarity metric and for which PrCj jd0 is highest. = = 1, Rocchio shows that each prototype vector HBAY ES d0 = argmax PrCj jd0 8 maximizes the mean similarity of the positive training C 2Cj examples with the prototype vector cj minus the mean similarity of the negative training examples with the PrCj jd0 can be split up by considering documents prototype vector cj . separately according to their length l. X1 1 X cos~ ; d , 1 X PrCj jd0 = l=1 PrCj jd0; l Prljd0 9 jCj j d2C cj ~ jD , C j c ~ cos~j ; d 4 ~ j d2D,C ~ Prljd0 equals one for the length l0 of document d0 j j
  6. 6. and is zero otherwise. After applying Bayes' theorem The following is the resulting decision rule if equations to PrCj jd0; l we can therefore write: 8, 10 and 12 are combined. 0 0 0 PrCj jd0 = P Prd jCj0;jlC0; lPrCj jl 0 jl0 10 jQj d 0 Prd C 2C 0 PrC PrCj PrwijCj HBAY ES d0 = argmax i=1 0 0jCj ; l0 is the probability of observing document C 2C P jd j 14 Prd 0 Q PrwijC 0 0 PrC j d0 in class Cj given its length l0. PrCj jl0 is the prior C 2C i=1 probability that a document of length l0 is in class Q PrwjC TF w;d 0 Cj . In the following we will assume, that the cate- PrCj j 0 gory of a document does not depend on its length, so w2FQ = argmax P PrC 0 PrwjC 0TF w;d 15 ^ PrCj jl0 = PrCj . An estimate PrCj for PrCj C 2C j 0 can be calculated from the fraction of training docu- C 2C 0 w2F ments that is assigned to class Cj . If PrCj jd0 is not needed as a measure of con dence, PrCj = P jCj jjC 0j = jjCjjj ^ D 11 the denominator can be left out since it does not change the argmax. C 2C 0 jCj j denotes the number of training documents in class Cj and jDj is the total number of documents. 4 PrTFIDF: A Probabilistic Classi er The estimation of Prd0 jCj ; l0 is more di cult. Derived from TFIDF Prd0jCj ; l0 is the probability of observing a document 0 d in class Cj given that we consider only documents of In the following I will analyze the TFIDF classi er length l0. Since there is - even for a simplifying repre- in a probabilistic framework. I will propose a classi- sentation as used here - a huge number of di erent doc- er, called PrTFIDF, and then show its relationship to uments, it is impossible to collect a su ciently large the TFIDF algorithm. In terms of the design choices number of training examples to estimate this proba- listed above I will show that the PrTFIDF algorithm bility without prior knowledge or further assumptions. is equivalent to a TFIDF classi er using the following In our case the estimation becomes possible due to the settings: way documents are assumed to be generated. The un- igram models introduced above imply that a word's the word weighting mechanism uses a re ned IDF occurrence is only dependent on the class the docu- weight especially adapted to text categorization, ment comes from, but that it occurs independently1 document length normalization is done using the of the other words in the document and that it is not number of words and dependent on the document length. So Prd0 jCj ; l0 the similarity measure is the inner product. can be written as: Yjd j Prd0 jCj ; l0 i=1 PrwijCj Other researchers have already proposed theoret- 0 12 ical interpretations of the vector space retrieval wi ranges over the sequence of words in document d 0 model Bookstein, 1982 Wang et al., 1992 and the which are element of the feature vector F. jd0j is the TFIDF word weighting scheme Wong, Yao, 1989 number of words in document d0. The estimation of Wu, Salton, 1981 . However, their work analyzes only Prd0jCj is reduced to estimating each Prwi jCj in- parts ofretrieval instead of on text categorization. mation the TFIDF algorithm and is based on infor- dependently. A Bayesian estimate is used for Prwi jCj . PrwijCj = jF j +1P TFwi; Cj 0; C ^ + 13 4.1 The PrTFIDF Algorithm w 2jF j TF w j 0 The naive Bayes classi er proposed in the previous sec- TFw; Cj is the overall number of times word w oc- tion provided an estimate of the probability PrCj jd0 curs within the documents in class Cj . This estima- that document d0 is in class Cj , making the simplifying tor, which is often called the Laplace estimator, is sug- assumption of word independence. The PrTFIDF Al- gested in Vapnik, 1982 pages 54-55. It assumes that gorithm uses a di erent way approximating PrCj jd0 the observation of each word is a priori equally likely. I inspired by the retrieval with probabilistic indexingquot; found that this Bayesian estimator works well in prac- RPI approach proposed in Fuhr, 1989 . In this ap- tice, since it does not falsely estimate probabilities to proach a set of descriptors X is used to represent the be zero. content of documents. A descriptor x is assigned to a 1 The weaker assumption of linked-dependencequot; is ac- document d with a certain probability Prxjd. So us- tually su cient Cooper, 1991 , but not considered here for ing the theorem of total probability in line 16 and simplicity. Bayes' theorem in line 17 we can write
  7. 7. X PrCj jd = PrCj jx; d Prxjd 16 probability that Cj is the correct category of d given x2X that we only know the randomly drawn word w from X PrdjCj ; x d. Bayes' formula can be used to rewrite PrCj jw: = PrCj jx Prxjd 17 x2X Prdjx PrCj jw = P PrwjCj jCPrCj 0 Prw 0 PrC 21 C 2C 0 To make the estimation tractable the simplifying as- As in the previous section, PrCj can be estimated sumption that PrdjCj ; X= Prdjx is made now. x from the fraction of the training documents that are PrCj jd PrCj jx Prxjd 18 assigned to class Cj . PrCj = P jCj jjC 0j = jjCjjj x2X ^ 22 The validity of the assumption depends on the classi - C 2C0 D cation task and the choice of the set of descriptors X. It states that descriptor x provides enough informa- Finally PrwjCj can be estimated as tion about d so that no information about document ^ 1 X Prwjd PrwjCj = jC j ^ 23 d is gained by taking its category Cj into account. j d2Cj As mentioned above the set of descriptors X is part of the design. A pragmatic choice for X used in the The resulting decision rule for PrTFIDF is following is to consider all bags with n words from the HPrTFIDF d0 = feature set F as potential descriptors; e.g., for n = 3 X PrwjCj PrCj 0 P these are all bags containing three words from F . The number n of words is a parameter which controls the argmax C 2C j w2F C 2C 0 PrwjC 0 PrC 0 Prwjd 24 quality of the approximation versus the complexity of the estimation. 4.2 The Connection between TFIDF and Another way of looking at equation 18, especially PrTFIDF suited for the choice of X considered here, is the fol- This section will show the relationship of the PrTFIDF lowing. PrCj jd is approximated by the expectation classi cation rule to the TFIDF algorithm from section of PrCj jx, where x consists of a sequence of n words 3.2.1. In the following I will start with the decision rule drawn randomly from document d. For both interpre- for PrTFIDF and then transform it into the shape of tations the underlying assumption is that text docu- a TFIDF classi er. From equation 24 we have ments are highly redundant with respect to the classi - cation task and that any sequence of n words from the HPrTFIDF d0 = document is equally su cient for classi cation. For X PrCj PrwjCj example, classifying documents according to whether argmax C 2C P PrC 0 PrwjC 0 Prwjd0 25 they are cooking recipes or not, it is probably equally j w2F C 2C 0 su cient to know either of the sentences from the doc- P ument. For n = jdj PrCj jd equals PrCj jx, but with The term C 2C PrC 0 PrwjC 0 in equation 25 can 0 decreasing n this simplifying assumption like the in- be re-expressed using a modi ed version of the inverse dependence assumption for the naive Bayes classi er document frequency IDF w. The de nition of in- will be violated in practice. Nevertheless this simpli - verse document frequency as stated in section 3.2.1 cation is worth trying as a starting point. was
  8. 8. jDj In the following the simplest case, namely n = 1, will IDFw = log DFw 26 be used and will lead to a TFIDF classi er like the one X 1 d contains w introduced in section 3.2.1. For n = 1 line 18 can be DF w = 27 written as X d2D 0 otherwise PrCj jd PrCj jw Prwjd 19 w2F I now introduce a re ned version of IDFw suggested by the PrTFIDF algorithm.
  9. 9. j j It remains to estimate the two probabilities from line 19. Prwjd can be estimated from the representa- IDF 0 w = sqrt DFDw0 28 tion of document d. X TFw; d Prwjd = P TFw; d 0; d = TF jw; d 20 ^ DF 0 w = 29 w 2F 0 TFw dj d2D jdj jdj denotes the number of words in document d. There are two di erences between this de nition of PrCj jw, the remaining part of equation 19, is the IDFw and the usual one. First, DF 0w is not the
  10. 10. X jCj j 1 X TFw; d 0 HPrTFIDF d0 = argmax jC j IDF 0 w2 TFw; d 0j 36 C 2C w2F jDj j j d2C jdj j jd X 1 X TFw; d IDF 0 w TF w; d0 IDF 0 w = argmax jjCjjj jC j C 2C w2F D j dj jd0j 37 j j d2C j number of documents with an occurrence of word w, From the form of the decision rule in the previous but rather is the sum of the relative frequencies of w in lines it is easy to see that the PrTFIDF decision rule each document. So IDF 0 w can make use of fre- is equivalent with the TFIDF decision rule using the quency information instead of just considering binary modi ed inverse document frequency weight IDF 0 w, occurrence information. Nevertheless the dynamics the number of words as document length normaliza- of DFw and DF 0 w are similar. The more often tion and the inner product for measuring similarity. a word w occurs throughout the corpus, the higher Furthermore it suggests how to set the parameters DFw and DF 0w will be. The dynamics are di er- and . For each category j = jjC jj whereas = 0. j D ent only in case there is a small fraction of documents in which the word w occurs very frequently. Then DF 0 w will rise faster than DF w. The second dif- 4.3 Implications of the Analysis ference is that the square root is used to dampen the The analysis shows how and under which preconditions e ect of the document frequency instead of the log- the TFIDF classi er ts into a probabilistic frame- arithm. Nevertheless, both functions are similar in work. The PrTFIDF classi er o ers a new view on shape and reduce the impact of high document fre- the vector space model and the TFIDF word weight- quencies. ing heuristic for text categorization and advances the Replacing probabilities with their estimators, the ex- P theoretical understanding of their interactions. The pression C 2C PrC 0 PrwjC 0 can be reduced to a analysis also suggests improvements to the TFIDF al- function of IDF 0 w. 0 gorithm and that the following changes should lead to P PrC 0 PrwjC 0 30 a better classi er. PrTFIDF is an implementation of C 2C 0 TFIDF incorporating these changes, namely X jC 0j 1 X TF w; d = C 2C jDj jC 0j jdj 31 d2C Incorporation of prior probabilities PrCj through . 0 0 X 1 X TFw; d = 32 Use of IDF 0 w for word weighting instead of C 2C jDj d2C 0 jdj 0 IDF w. P P TF w;d Use of the number of words for document normal- = C 2C jdDC jdj 2 33 0 0 j ization instead of the Euclidian length. P TF w;d Use of the inner product for computing similarity. = d2DjDj jdj 34 5 Experiments = IDF1w2 0 35 The following experiments were performed to nd out Using this and again substituting probabilities with in how far the implications of the theoretical analysis their estimators, the decision rule can be rewritten as lead to an improved classi cation algorithm in prac- in 37 above. Extracting the prototype vector com- tice. The performances of PrTFIDF, TFIDF, and the ponent and the document representation component naive Bayes classi er BAYES are compared on six cat- we get to the decision rule egorization tasks. ~ ~0 HPrTFIDF d0 = argmax c1j jd0j 38 5.1 Data Sets C 2C j d 1 X d 39~ 5.1.1 Newsgroup Data cj = jjCjjj jC j ~ D j d2C jdj This data set consists of Usenet articles Ken Lang col- j lected from 20 di erent newsgroups table 1. 1000 ar- di = TFwi; d IDF 0 wi 40 ticles were taken from each of the newsgroups, which
  11. 11. comp.graphics sci.electronics 100 20 Newsgroups comp.windows.x sci.crypt comp.os.ms-windows.misc sci.space 95 comp.sys.mac.hardware sci.med comp.sys.ibm.pc.hardware misc.forsale 90 talk.politics.guns alt.atheism 85 Accuracy Test (%) talk.politics.mideast rec.sport.baseball talk.politics.misc rec.sport.hockey 80 talk.religion.misc rec.autos 75 soc.religion.christian rec.motorcycles 70 Table 1: Usenet newsgroups used in newsgroup data. 65 PrTFIDF BAYES TFIDF 60 670 1340 2680 6700 13400 Training examples PrTFIDF BAYES TFIDF Figure 2: Accuracy versus the number of training ex- Newsgroups 91.8 89.6 86.3 amples on the newsgroup data. acqquot; 88.9 88.5 84.5 wheatquot; 93.9 94.8 90.9 crudequot; 90.2 95.5 85.4 100 Reuters ’acq’ subsampled earnquot; 90.5 90.9 90.6 cbondquot; 91.9 90.9 87.7 95 90 Table 2: Maximum accuracy in percentages. Accuracy Test (%) 85 80 75 makes a total of 20000 documents in this collection. 70 Except for a small fraction of the articles, each docu- ment belongs to exactly one newsgroup. The task is PrTFIDF 65 BAYES TFIDF to learn which newsgroup an article was posted to2 . 60 10 20 100 Training examples 3302 The results reported on this dataset are averaged over a number of random test training splits using binomial Figure 3: Accuracy versus the number of training ex- sign tests to estimate signi cance. In each experiment amples on the Reuters categories acqquot;. 33 of the data was used for testing. 5.1.2 Reuters Data In the following experiments articles which appeared The Reuters-22173 data was collected by the Carnegie on April 7, 1987 or before are in the training set. Ar- group from the Reuters newswire in 1987. Instead of ticles which appeared later are in the test set. This averaging over all 135 categories, the following presents results in a corpus of 14,704 training examples and a more detailed analysis of ve categories - namely 6,746 test examples. Since the TFIDF classi er does the three most frequent categories earnquot;, acqquot;, and not have a principled way of dealing with uneven class cbondquot; and two categories with special properties distributions, to allow a fair comparison the data is wheatquot; and crudequot;. subsampled randomly so that there is an equal num- ber of positive and negative examples. The results pre- The wheatquot; and the crudequot; category have very nar- sented here are averaged over a number of trials and row de nitions. Classifying according to whether a binomial sign tests are used to estimate signi cance. document contains the word wheat yields an accu- racy of 99.7 for the wheatquot; category. The category 5.2 Experimental Results acqquot; corporate acquisitions for example does not have such an obvious de nition. Its concept is more Table 2 shows the maximum accuracy each learn- abstract and a number of words are reasonable predic- ing method achieves. On the newsgroup data PrT- tors. FIDF performs signi cantly better than BAYES and 2 About 4 of the articles were cross-posted among two BAYES is signi cantly better than TFIDF. Compared of the newsgroups. In these cases predicting either of the to TFIDF, PrTFIDF leads to a reduction of error of two newsgroups is counted as a correct prediction. about 40.
  12. 12. Reuters ’wheat’ subsampled Reuters ’earn’ subsampled 100 100 95 95 90 90 85 85 Accuracy Test (%) Accuracy Test (%) 80 80 75 75 70 70 PrTFIDF PrTFIDF 65 BAYES 65 BAYES TFIDF TFIDF 60 60 10 20 100 424 10 20 100 5754 Training examples Training examples Figure 4: Accuracy versus the number of training ex- Figure 6: Accuracy versus the number of training ex- amples on the Reuters categories wheatquot;. amples on the Reuters categories earnquot;. Reuters ’crude’ subsampled 100 95 Reuters ’cbond’ subsampled 100 90 95 85 Accuracy Test (%) 90 80 85 Accuracy Test (%) 75 80 70 75 PrTFIDF 65 BAYES TFIDF 70 60 10 20 100 776 PrTFIDF Training examples 65 BAYES TFIDF Figure 5: Accuracy versus the number of training ex- 60 10 20 100 1906 amples on the Reuters categories crudequot;. Training examples Figure 7: Accuracy versus the number of training ex- amples on the Reuters categories cbondquot;. PrTFIDF and BAYES outperform TFIDF on the Reuters categories acqquot;, wheatquot;, crudequot;, and cbondquot; as well. Comparing PrTFIDF and BAYES, BAYES tends to work better on the tasks where cer- tain single keywords have very high prediction accu- PrTFIDF does particularly well in the newsgroup ex- racy - namely the tasks wheatquot; and crudequot;. The periment gure 2 for small numbers of training ex- opposite is true for the PrTFIDF classi er. It achieves amples. The performance of BAYES approaches the comparable performance or performance gains over one of PrTFIDF for high numbers, but stays below BAYES on the categories acqquot; and cbondquot; as well TFIDF for small training sets. The accuracy of the as on the newsgroup data. This behaviour is interest- TFIDF classi er increases less steeply with the num- ing, since it is plausible given the di erent simplifying ber of training examples compared to the probabilistic assumptions PrTFIDF and BAYES make. All classi- methods. ers perform approximately the same on the category For the Reuters category acqquot; BAYES and PrTFIDF earnquot;. show nearly identical curves gure 3. TFIDF is sig- Figures 2 to 7 show accuracy in relation to the num- ni cantly below the two probabilistic methods over the ber of training examples. As expected the accu- whole spectrum. For the tasks wheatquot; gure 4, racy increases with the number of training examples. crudequot; gure 5, and cbondquot; gure 7 all classi- This holds for all learning methods and categoriza- ers perform similar for small training sets and the tion tasks. Nevertheless, there are di erences in how di erence generally increases with an increasing num- quickly the accuracy increases. In contrast to BAYES, ber of training examples.
  13. 13. 6 Conclusions Cooper, 1991 W. Cooper, Some Inconsistencies and Misnomers in Probabilistic Information Re- This paper shows the relationship between text classi- trievalquot;, International ACM SIGIR Conference, ers using the vector space model with TFIDF word pages 57-61, 1991. weighting and probabilistic classi ers. It presents a Fuhr, 1989 N. Fuhr, Models for Retrieval with Prob- probabilistic analysis of a particular TFIDF classi- abilistic Indexingquot;, Information Processing and er and describes the algorithm using the same ba- Management, 251, pages 55-72, 1989. sic techniques from statistical pattern recognition that Hayes et al., 1988 P. Hayes, L. Knecht, M. Cellio, A are used in probabilistic classi ers like BAYES. The news story categorization systemquot;, Second Con- analysis o ers a theoretical explanation for the TFIDF ference on Applied Natural Language Process- word weighting heuristic in combination with the vec- ing, pages 9-17, 1988. tor space retrieval model for text categorization and gives insight into the underlying assumptions. Joachims et al., 1997 T. Joachims, D. Freitag, T. Conclusions drawn from the analysis lead to the PrT- Mitchell, WebWatcher: A Tour Guide for the World Wide Webquot;, International Joint Confer- FIDF classi er, which eliminates the ine cient param- ence on Arti cial Intelligence IJCAI, 1997. eter tuning and design choices of the TFIDF method. This makes the PrTFIDF classi er easy to use and Lang, 1995 K. Lang, NewsWeeder: Learning to Fil- empirical results on six classi cation tasks support its ter Netnewsquot;, International Conference on Ma- applicability on real world classi cation problems. Al- chine Learning, 1995. though the TFIDF method showed reasonable accu- Rocchio, 1971 J. Rocchio. Relevance Feedback in racy on all classi cation tasks, the two probabilistic Information Retrievalquot;, in Salton: The SMART methods BAYES and PrTFIDF showed performance Retrieval System: Experiments in Automatic improvements of up to 40 reduction of error rate on Document Processing, Chapter 14, pages 313- ve of the six tasks. These empirical results suggest 323, Prentice-Hall, 1971. that a probabilistically founded modelling is preferable Salton, 1991 G. Salton, Developments in Automatic to the heuristic TFIDF modelling. The probabilistic Text Retrievalquot;, Science, Vol. 253, pages 974- methods are preferable from a theoretical viewpoint, 979, 1991. too, since a probabilistic framework allows the clear statement and easier understanding of the simplify- Salton, Buckley, 1988 G. Salton, C. Buckley, Term ing assumptions made. The relaxation as well as the Weighting Approaches in Automatic Text Re- combination of those assumptions provide promising trievalquot;, Information Processing and Manage- starting points for future research. ment, Vol. 24, No. 5, pages 513-523, 1988. Vapnik, 1982 V. Vapnik, Estimation of Dependen- Acknowledgements cies Based on Empirical Dataquot;, Springer, 1982. I would like to thank Tom Mitchell for his inspiring Wang et al., 1992 Z. Wang, S. Wong, Y. Yao, An comments on this work. Many thanks also to Se- Analysis of Vector Space Models Based on Com- bastian Thrun, Phoebe Sengers, Sean Slattery, Ralf putational Geometryquot;, International ACM SI- Klinkenberg, and Peter Brockhausen for their sugges- GIR Conference, 1992. tions regarding this paper, and to Ken Lang for the Wong, Yao, 1989 S. Wong, Y. Yao, A Note on In- dataset and parts of the code used in the experiments. verse Document Frequency Weighting Schemequot;, This research is supported by ARPA under grant num- Technical Report 89-990, Department of Com- ber F33615-93-1-1330 at Carnegie Mellon University. puter Science, Cornell University, 1989. Wu, Salton, 1981 H. Wu, G. Salton, A Compari- References son of Search Term Weighting: Term Relevance vs. Inverse Document Frequencyquot;, Technical Re- Bookstein, 1982 A. Bookstein, Explanation and port 81-457, Department of Computer Science, Generalization of Vector Models in Informa- Cornell University, 1981. tion Retrievalquot;, in G. Salton, H. Schneider: Research and Development in Information Re- trieval, Berlin, 1982. Buckley et al., 1994 C. Buckley, G. Salton, J. Allan, The E ect of Adding Relevance Information in a Relevance Feedback Environmentquot;, Interna- tional ACM SIGIR Conference, pages 292-300, 1994.