A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for
Universitat Dortmund, Fachbereich Informatik, Lehrstuhl 8
Baroper Str. 301
44221 Dortmund, Germany
Abstract the algorithm is intuitive, it has a number of problems
which - as I will show - lead to comparably low clas-
The Rocchio relevance feedback algorithm is si cation accuracy: 1 The objective of the Rocchio
one of the most popular and widely applied algorithm is to maximize a particular functional in-
learning methods from information retrieval. troduced in section 3.2.1. Nevertheless Rocchio does
Here, a probabilistic analysis of this algo- not show why maximizing this functional should lead
rithm is presented in a text categorization to a high classi cation accuracy. 2 Heuristic compo-
framework. The analysis gives theoretical nents of the algorithm o er many design choices and
insight into the heuristics used in the Roc- there is little guidance when applying this algorithm
chio algorithm, particularly the word weight- to a new domain. 3 The algorithm was developed
ing scheme and the similarity metric. It also and optimized for relevance feedback in information
suggests improvements which lead to a prob- retrieval; it is not clear which heuristics will work best
abilistic variant of the Rocchio classi er. The for text categorization.
Rocchio classi er, its probabilistic variant, The major heuristic component of the Rocchio algo-
and a naive Bayes classi er are compared rithm is the TFIDF term frequency inverse docu-
on six text categorization tasks. The results ment frequency Salton, Buckley, 1988 word weight-
show that the probabilistic algorithms are ing scheme. Di erent avors of this heuristic lead to a
preferable to the heuristic Rocchio classi er multitude of di erent algorithms. Due to this heuristic
not only because they are more well-founded, this class of algorithms will be called TFIDF classi ers
but also because they achieve better perfor- in the following.
A more theoretically founded approach to text cate-
gorization provide naive Bayes classi ers. These algo-
1 Introduction rithms use probabilistic models for classi cation and
allow the explicit statement of simplifying assump-
Text categorization is the process of grouping docu- tions.
ments into di erent categories or classes. With the The contribution of this paper is a probabilistic ana-
amount of online information growing rapidly, the lysis of a TFIDF classi er. This analysis makes the
need for reliable automatic text categorization has in- implicit assumption of the TFIDF classi er as explicit
creased. Text categorization techniques are used, for as for the naive Bayes classi er. Furthermore it pro-
example, to build personalized netnews lter which vides insight into how the TFIDF algorithm can be im-
learn about the news-reading preferences of a user proved, leading to a probabilistic version of the TFIDF
Lang, 1995 . They are used to index news stories algorithm, called PrTFIDF. PrTFIDF optimizes the
Hayes et al., 1988 or guide a user's search on the di erent design choices of the TFIDF algorithm as a
World Wide Web Joachims et al., 1997 . whole and gives clear recommendations on how to set
One of the most widely applied learning algorithms the parameters involved. Empirical results on six cate-
for text categorization is the Rocchio relevance feed- gorization tasks show that PrTFIDF not only enables
back method Rocchio, 1971 developed in information a better theoretical understanding of the TFIDF al-
retrieval. Originally designed for optimizing queries gorithm, but also performs better in practice without
from relevance feedback, the algorithm can be adapted being conceptually or computationally more complex.
to text categorization and routing problems. Although
graphics The representation of a problem has a strong impact
on the generalization accuracy of a learning system.
Subject: Need specs on Apple QT
I need to get the specs, or at least a
0 hockey For categorization a document, which typically is a
string of characters, has to be transformed into a rep-
very verbose interpretation of the specs,
for QuickTime. Technical articles from 0 car
resentation which is suitable for the learning algorithm
magazines and references to books would
be nice, too.
I also need the specs in a fromat usable .
. and the classi cation task. IR research suggests that
words work well as representation units and that their
on a Unix or MS-Dos system. I can’t
do much with the QuickTime stuff they
ordering in a document is of minor importance for
have on ...
many tasks. This leads to a representation of docu-
ments as bags of words.
Figure 1: Bag-of-words representation in an attribute This bag-of-words representation is equivalent to an
value style. attribute-value representation as used in machine
learning. Each distinct word corresponds to a feature
with the number of times the word occurs in the doc-
ument as its value. Figure 1 shows an example feature
This paper is structured as follows. Section 2 in- vector for a particular document. To avoid unneces-
troduces the de nition of text categorization used sarily large feature vectors words are considered as fea-
throughout this paper. A TFIDF classi er and a naive tures only if they occur in the training data at least
Bayes classi er are described in section 3. Section 4 m e.g. m = 3 times. The set of considered features
presents the probabilistic analysis of the TFIDF clas- i.e. words will be called F.
si er and states its implications. Empirical results and
conclusions can be found in sections 5 and 6. 3.2 Learning Algorithms
3.2.1 TFIDF Classi er
2 Text Categorization This type of classi er is based on the relevance
feedback algorithm originally proposed by Rocchio
The goal of text categorization is the classi cation Rocchio, 1971 for the vector space retrieval model
of documents into a xed number of prede ned cat- Salton, 1991 . Due to its heuristic components, there
egories. The working de nition used throughout this are a number of similar algorithms corresponding to
paper assumes that each document d is assigned to ex- the particular choice of those heuristics. The three
actly one category. To put it more formally, there is a main design choices are
set of classes C and a set of training documents D
Furthermore, there is a target concept T : D ! C the word weighting method
which maps documents to a class. Td is known for the document length normalization
the documents in the training set. Through super-
vised learning the information contained in the train- the similarity measure.
ing examples can be used to nd a model or hypothesis
H : D ! C which approximates T. Hd is the func- An overview of some heuristics is given in
tion de ning the class to which the learned hypothesis Salton, Buckley, 1988 . In the following the most
assigns document d; it can be used to classify new doc- popular combination will be used known as tfcquot;:
uments. The objective is to nd a hypothesis which tfquot; word weights Salton, Buckley, 1988 , document
maximizes accuracy, i.e. the percentage of times H length normalization using Euclidian vector length and
and T agree. cosine similarity.
Originally developed for information retrieval, the
algorithm returns a ranking of documents without
3 Learning Methods for Text providing a threshold to de ne a decision rule for
Categorization class membership. Therefore the algorithm has to be
adapted to be used for text categorization. The variant
This section describes the general framework for the presented here seems to be the most straightforward
experiments presented in this paper and de nes the adaptation of the Rocchio algorithm to text catego-
particular TFIDF classi er and the naive Bayes classi- rization and domains with more than two categories.
er used. The TFIDF classi er provides the basis for The algorithm builds on the following representation
the analysis in section 4. of documents. Each document d is represented as a
vector d = d1; : : :; djF j so that documents with Nevertheless it is unclear if or how maximizing this
similar content have similar vectors according to a functional connects to the accuracy of the resulting
xed similarity metric. Each element di represents classi er.
a distinct word wi . di for a document d is calcu- The resulting set of prototype vectors, one vector for
lated as a combination of the statistics TFwi; d and each class, represents the learned model. This model
DFwi Salton, 1991 . The term frequency TFwi; d can be used to classify a new document d0. Again
is the number of times word wi occurs in document ~
the document is represented as a vector d0 using the
d and the document frequency DFwi is the number scheme described above. To classify d 0 the cosines of
of documents in which word wi occurs at least once. ~
the prototype vectors cj with d0 are calculated. d0 is
The inverse document frequency IDFwi can be cal- assigned to the class with which its document vector
culated from the document frequency. has the highest cosine.
IDFwi = log DF w 1 c ~
HTFIDF d0 = argmax cos~j ; d0
argmaxfx returns the argument x for which fx
Here, jDj is the total number of documents. Intu- is maximum and HTFIDF d0 is the category to which
itively, the inverse document frequency of a word is the algorithm assigns document d0. The algorithm can
low if it occurs in many documents and is highest if be summarized in the following decision rule:
the word occurs in only one. The so-called weight di ~0
of word wi in document d is then HTFIDF d0 = argmax jjcj jj d0
C 2C cj jjd jj
~ ~ 6
di = TF wi; d IDF wi 2
PjF j ci d0i
This word weighting heuristic says that a word wi is = argmax qiPjFjj i
an important indexing term for document d if it occurs
i=1 cj 2
frequently in it the term frequency is high. On the In 7 the normalization with the length of the docu-
other hand, words which occur in many documents are ment vector is left out since it does not in uence the
rated less important indexing terms due to their low argmax.
inverse document frequency.
Learning is achieved by combining document vectors 3.2.2 Naive Bayes Classi er
into a prototype vector cj for each class Cj . First,
both the normalized document vectors of the positive The classi er presented in this section uses a proba-
examples for a class as well as those of the negative bilistic model of text. Although this model is a strong
examples for a class are summed up. The prototype simpli cation of the true process by which text is gen-
vector is then calculated as a weighted di erence of erated, the hope is that it still captures most of the
each. important characteristics.
~ 1 X d ,
cj = jC j
~ 1 X d ~ In the following word-based unigram models of text
j ~ jjdjj~ jD , Cj j ~ ~ 3
jjdjj will be used, i.e. words are assumed to occur indepen-
dently of the other words in the document. There are
jCj such models, one for each category. All documents
and are parameters that adjust the relative impact assigned to a particular category are assumed to be
of positive and negative training examples. As recom- generated according to the model associated with this
mended in Buckley et al., 1994 = 16 and = 4 category.
will be used in the following. Cj is the set of train-
ing documents assigned to class j and jjdjj denotes the The following describes one approach to estimating
~. Additionally Rocchio PrCj jd0, the probability that a document d0 is in class
Euclidian length of a vector d Cj . Bayes' rule says that to achieve the highest clas-
requires that negative elements of the vector cj are si cation accuracy, d0 should be assigned to the class
set to 0. Using the cosine as a similarity metric and for which PrCj jd0 is highest.
= = 1, Rocchio shows that each prototype vector HBAY ES d0 = argmax PrCj jd0 8
maximizes the mean similarity of the positive training C 2Cj
examples with the prototype vector cj minus the mean
similarity of the negative training examples with the PrCj jd0 can be split up by considering documents
prototype vector cj . separately according to their length l.
1 X cos~ ; d , 1 X PrCj jd0 = l=1 PrCj jd0; l Prljd0 9
jCj j d2C cj ~ jD , C j c ~
cos~j ; d 4
~ j d2D,C
~ Prljd0 equals one for the length l0 of document d0
and is zero otherwise. After applying Bayes' theorem The following is the resulting decision rule if equations
to PrCj jd0; l we can therefore write: 8, 10 and 12 are combined.
0 0 0
PrCj jd0 = P Prd jCj0;jlC0; lPrCj jl 0 jl0 10 jQj
C 2C 0 PrC PrCj PrwijCj
HBAY ES d0 = argmax i=1
0jCj ; l0 is the probability of observing document C 2C P jd j 14
Prd 0 Q PrwijC 0
d0 in class Cj given its length l0. PrCj jl0 is the prior C 2C i=1
probability that a document of length l0 is in class Q PrwjC TF w;d
Cj . In the following we will assume, that the cate- PrCj j
gory of a document does not depend on its length, so w2FQ
= argmax P PrC 0 PrwjC 0TF w;d 15
PrCj jl0 = PrCj . An estimate PrCj for PrCj C 2C
can be calculated from the fraction of training docu- C 2C
ments that is assigned to class Cj . If PrCj jd0 is not needed as a measure of con dence,
PrCj = P jCj jjC 0j = jjCjjj
D 11 the denominator can be left out since it does not change
jCj j denotes the number of training documents in class
Cj and jDj is the total number of documents. 4 PrTFIDF: A Probabilistic Classi er
The estimation of Prd0 jCj ; l0 is more di cult. Derived from TFIDF
Prd0jCj ; l0 is the probability of observing a document
d in class Cj given that we consider only documents of In the following I will analyze the TFIDF classi er
length l0. Since there is - even for a simplifying repre- in a probabilistic framework. I will propose a classi-
sentation as used here - a huge number of di erent doc- er, called PrTFIDF, and then show its relationship to
uments, it is impossible to collect a su ciently large the TFIDF algorithm. In terms of the design choices
number of training examples to estimate this proba- listed above I will show that the PrTFIDF algorithm
bility without prior knowledge or further assumptions. is equivalent to a TFIDF classi er using the following
In our case the estimation becomes possible due to the settings:
way documents are assumed to be generated. The un-
igram models introduced above imply that a word's the word weighting mechanism uses a re ned IDF
occurrence is only dependent on the class the docu- weight especially adapted to text categorization,
ment comes from, but that it occurs independently1 document length normalization is done using the
of the other words in the document and that it is not number of words and
dependent on the document length. So Prd0 jCj ; l0 the similarity measure is the inner product.
can be written as:
Prd0 jCj ; l0 i=1 PrwijCj Other researchers have already proposed theoret-
ical interpretations of the vector space retrieval
wi ranges over the sequence of words in document d 0 model Bookstein, 1982 Wang et al., 1992 and the
which are element of the feature vector F. jd0j is the TFIDF word weighting scheme Wong, Yao, 1989
number of words in document d0. The estimation of Wu, Salton, 1981 . However, their work analyzes only
Prd0jCj is reduced to estimating each Prwi jCj in- parts ofretrieval instead of on text categorization.
the TFIDF algorithm and is based on infor-
dependently. A Bayesian estimate is used for Prwi jCj .
PrwijCj = jF j +1P TFwi; Cj 0; C
^ + 13 4.1 The PrTFIDF Algorithm
w 2jF j TF w j
The naive Bayes classi er proposed in the previous sec-
TFw; Cj is the overall number of times word w oc- tion provided an estimate of the probability PrCj jd0
curs within the documents in class Cj . This estima- that document d0 is in class Cj , making the simplifying
tor, which is often called the Laplace estimator, is sug- assumption of word independence. The PrTFIDF Al-
gested in Vapnik, 1982 pages 54-55. It assumes that gorithm uses a di erent way approximating PrCj jd0
the observation of each word is a priori equally likely. I inspired by the retrieval with probabilistic indexingquot;
found that this Bayesian estimator works well in prac- RPI approach proposed in Fuhr, 1989 . In this ap-
tice, since it does not falsely estimate probabilities to proach a set of descriptors X is used to represent the
be zero. content of documents. A descriptor x is assigned to a
1 The weaker assumption of linked-dependencequot; is ac- document d with a certain probability Prxjd. So us-
tually su cient Cooper, 1991 , but not considered here for ing the theorem of total probability in line 16 and
simplicity. Bayes' theorem in line 17 we can write
PrCj jd = PrCj jx; d Prxjd 16 probability that Cj is the correct category of d given
x2X that we only know the randomly drawn word w from
X PrdjCj ; x d. Bayes' formula can be used to rewrite PrCj jw:
= PrCj jx Prxjd 17
x2X Prdjx PrCj jw = P PrwjCj jCPrCj 0
Prw 0 PrC 21
To make the estimation tractable the simplifying as- As in the previous section, PrCj can be estimated
sumption that PrdjCj ; X= Prdjx is made now.
x from the fraction of the training documents that are
PrCj jd PrCj jx Prxjd 18 assigned to class Cj .
PrCj = P jCj jjC 0j = jjCjjj
The validity of the assumption depends on the classi - C 2C0 D
cation task and the choice of the set of descriptors X.
It states that descriptor x provides enough informa- Finally PrwjCj can be estimated as
tion about d so that no information about document ^ 1 X Prwjd
PrwjCj = jC j ^ 23
d is gained by taking its category Cj into account. j d2Cj
As mentioned above the set of descriptors X is part
of the design. A pragmatic choice for X used in the The resulting decision rule for PrTFIDF is
following is to consider all bags with n words from the HPrTFIDF d0 =
feature set F as potential descriptors; e.g., for n = 3 X PrwjCj PrCj 0
these are all bags containing three words from F . The
number n of words is a parameter which controls the
w2F C 2C
PrwjC 0 PrC 0 Prwjd 24
quality of the approximation versus the complexity of
the estimation. 4.2 The Connection between TFIDF and
Another way of looking at equation 18, especially PrTFIDF
suited for the choice of X considered here, is the fol- This section will show the relationship of the PrTFIDF
lowing. PrCj jd is approximated by the expectation classi cation rule to the TFIDF algorithm from section
of PrCj jx, where x consists of a sequence of n words 3.2.1. In the following I will start with the decision rule
drawn randomly from document d. For both interpre- for PrTFIDF and then transform it into the shape of
tations the underlying assumption is that text docu- a TFIDF classi er. From equation 24 we have
ments are highly redundant with respect to the classi -
cation task and that any sequence of n words from the HPrTFIDF d0 =
document is equally su cient for classi cation. For X PrCj PrwjCj
example, classifying documents according to whether argmax
P PrC 0 PrwjC 0 Prwjd0 25
they are cooking recipes or not, it is probably equally
w2F C 2C
su cient to know either of the sentences from the doc- P
ument. For n = jdj PrCj jd equals PrCj jx, but with The term C 2C PrC 0 PrwjC 0 in equation 25 can
decreasing n this simplifying assumption like the in- be re-expressed using a modi ed version of the inverse
dependence assumption for the naive Bayes classi er document frequency IDF w. The de nition of in-
will be violated in practice. Nevertheless this simpli - verse document frequency as stated in section 3.2.1
cation is worth trying as a starting point. was
In the following the simplest case, namely n = 1, will IDFw = log DFw 26
be used and will lead to a TFIDF classi er like the one X 1 d contains w
introduced in section 3.2.1. For n = 1 line 18 can be DF w = 27
written as X d2D
PrCj jd PrCj jw Prwjd 19
w2F I now introduce a re ned version of IDFw suggested
by the PrTFIDF algorithm.
It remains to estimate the two probabilities from line
19. Prwjd can be estimated from the representa- IDF 0 w = sqrt DFDw0 28
tion of document d. X TFw; d
Prwjd = P TFw; d 0; d = TF jw; d 20
^ DF 0 w = 29
0 TFw dj d2D jdj
jdj denotes the number of words in document d. There are two di erences between this de nition of
PrCj jw, the remaining part of equation 19, is the IDFw and the usual one. First, DF 0w is not the
X jCj j 1 X TFw; d 0
HPrTFIDF d0 = argmax jC j IDF 0 w2 TFw; d
C 2C w2F jDj
j j d2C jdj
X 1 X TFw; d IDF 0 w TF w; d0 IDF 0 w
= argmax jjCjjj jC j
C 2C w2F D j dj jd0j 37
j j d2C j
number of documents with an occurrence of word w, From the form of the decision rule in the previous
but rather is the sum of the relative frequencies of w in lines it is easy to see that the PrTFIDF decision rule
each document. So IDF 0 w can make use of fre- is equivalent with the TFIDF decision rule using the
quency information instead of just considering binary modi ed inverse document frequency weight IDF 0 w,
occurrence information. Nevertheless the dynamics the number of words as document length normaliza-
of DFw and DF 0 w are similar. The more often tion and the inner product for measuring similarity.
a word w occurs throughout the corpus, the higher Furthermore it suggests how to set the parameters
DFw and DF 0w will be. The dynamics are di er- and . For each category j = jjC jj whereas = 0.
ent only in case there is a small fraction of documents
in which the word w occurs very frequently. Then
DF 0 w will rise faster than DF w. The second dif- 4.3 Implications of the Analysis
ference is that the square root is used to dampen the The analysis shows how and under which preconditions
e ect of the document frequency instead of the log- the TFIDF classi er ts into a probabilistic frame-
arithm. Nevertheless, both functions are similar in work. The PrTFIDF classi er o ers a new view on
shape and reduce the impact of high document fre- the vector space model and the TFIDF word weight-
quencies. ing heuristic for text categorization and advances the
Replacing probabilities with their estimators, the ex-
P theoretical understanding of their interactions. The
pression C 2C PrC 0 PrwjC 0 can be reduced to a analysis also suggests improvements to the TFIDF al-
function of IDF 0 w.
gorithm and that the following changes should lead to
P PrC 0 PrwjC 0 30 a better classi er. PrTFIDF is an implementation of
TFIDF incorporating these changes, namely
X jC 0j 1 X TF w; d
C 2C jDj jC 0j jdj 31
d2C Incorporation of prior probabilities PrCj
X 1 X TFw; d
= 32 Use of IDF 0 w for word weighting instead of
C 2C jDj d2C
0 IDF w.
P P TF w;d Use of the number of words for document normal-
= C 2C jdDC jdj
j ization instead of the Euclidian length.
P TF w;d Use of the inner product for computing similarity.
= d2DjDj jdj 34
The following experiments were performed to nd out
Using this and again substituting probabilities with in how far the implications of the theoretical analysis
their estimators, the decision rule can be rewritten as lead to an improved classi cation algorithm in prac-
in 37 above. Extracting the prototype vector com- tice. The performances of PrTFIDF, TFIDF, and the
ponent and the document representation component naive Bayes classi er BAYES are compared on six cat-
we get to the decision rule egorization tasks.
HPrTFIDF d0 = argmax c1j jd0j 38 5.1 Data Sets
1 X d 39~ 5.1.1 Newsgroup Data
cj = jjCjjj jC j
~ D j d2C jdj This data set consists of Usenet articles Ken Lang col-
lected from 20 di erent newsgroups table 1. 1000 ar-
di = TFwi; d IDF 0 wi 40 ticles were taken from each of the newsgroups, which
comp.graphics sci.electronics 100
comp.os.ms-windows.misc sci.space 95
talk.politics.guns alt.atheism 85
Accuracy Test (%)
talk.religion.misc rec.autos 75
soc.religion.christian rec.motorcycles 70
Table 1: Usenet newsgroups used in newsgroup data. 65
670 1340 2680 6700 13400
PrTFIDF BAYES TFIDF Figure 2: Accuracy versus the number of training ex-
Newsgroups 91.8 89.6 86.3 amples on the newsgroup data.
acqquot; 88.9 88.5 84.5
wheatquot; 93.9 94.8 90.9
crudequot; 90.2 95.5 85.4 100
Reuters ’acq’ subsampled
earnquot; 90.5 90.9 90.6
cbondquot; 91.9 90.9 87.7
Table 2: Maximum accuracy in percentages. Accuracy Test (%)
makes a total of 20000 documents in this collection. 70
Except for a small fraction of the articles, each docu-
ment belongs to exactly one newsgroup. The task is
to learn which newsgroup an article was posted to2 . 60
10 20 100
The results reported on this dataset are averaged over
a number of random test training splits using binomial Figure 3: Accuracy versus the number of training ex-
sign tests to estimate signi cance. In each experiment amples on the Reuters categories acqquot;.
33 of the data was used for testing.
5.1.2 Reuters Data In the following experiments articles which appeared
The Reuters-22173 data was collected by the Carnegie on April 7, 1987 or before are in the training set. Ar-
group from the Reuters newswire in 1987. Instead of ticles which appeared later are in the test set. This
averaging over all 135 categories, the following presents results in a corpus of 14,704 training examples and
a more detailed analysis of ve categories - namely 6,746 test examples. Since the TFIDF classi er does
the three most frequent categories earnquot;, acqquot;, and not have a principled way of dealing with uneven class
cbondquot; and two categories with special properties distributions, to allow a fair comparison the data is
wheatquot; and crudequot;. subsampled randomly so that there is an equal num-
ber of positive and negative examples. The results pre-
The wheatquot; and the crudequot; category have very nar- sented here are averaged over a number of trials and
row de nitions. Classifying according to whether a binomial sign tests are used to estimate signi cance.
document contains the word wheat yields an accu-
racy of 99.7 for the wheatquot; category. The category 5.2 Experimental Results
acqquot; corporate acquisitions for example does not
have such an obvious de nition. Its concept is more Table 2 shows the maximum accuracy each learn-
abstract and a number of words are reasonable predic- ing method achieves. On the newsgroup data PrT-
tors. FIDF performs signi cantly better than BAYES and
2 About 4 of the articles were cross-posted among two BAYES is signi cantly better than TFIDF. Compared
of the newsgroups. In these cases predicting either of the to TFIDF, PrTFIDF leads to a reduction of error of
two newsgroups is counted as a correct prediction. about 40.
Reuters ’wheat’ subsampled Reuters ’earn’ subsampled
Accuracy Test (%)
Accuracy Test (%)
65 BAYES 65 BAYES
10 20 100 424 10 20 100 5754
Training examples Training examples
Figure 4: Accuracy versus the number of training ex- Figure 6: Accuracy versus the number of training ex-
amples on the Reuters categories wheatquot;. amples on the Reuters categories earnquot;.
Reuters ’crude’ subsampled
Reuters ’cbond’ subsampled
Accuracy Test (%)
Accuracy Test (%)
10 20 100 776 PrTFIDF
Training examples 65 BAYES
Figure 5: Accuracy versus the number of training ex-
10 20 100 1906
amples on the Reuters categories crudequot;.
Figure 7: Accuracy versus the number of training ex-
amples on the Reuters categories cbondquot;.
PrTFIDF and BAYES outperform TFIDF on the
Reuters categories acqquot;, wheatquot;, crudequot;, and
cbondquot; as well. Comparing PrTFIDF and BAYES,
BAYES tends to work better on the tasks where cer-
tain single keywords have very high prediction accu- PrTFIDF does particularly well in the newsgroup ex-
racy - namely the tasks wheatquot; and crudequot;. The periment gure 2 for small numbers of training ex-
opposite is true for the PrTFIDF classi er. It achieves amples. The performance of BAYES approaches the
comparable performance or performance gains over one of PrTFIDF for high numbers, but stays below
BAYES on the categories acqquot; and cbondquot; as well TFIDF for small training sets. The accuracy of the
as on the newsgroup data. This behaviour is interest- TFIDF classi er increases less steeply with the num-
ing, since it is plausible given the di erent simplifying ber of training examples compared to the probabilistic
assumptions PrTFIDF and BAYES make. All classi- methods.
ers perform approximately the same on the category For the Reuters category acqquot; BAYES and PrTFIDF
earnquot;. show nearly identical curves gure 3. TFIDF is sig-
Figures 2 to 7 show accuracy in relation to the num- ni cantly below the two probabilistic methods over the
ber of training examples. As expected the accu- whole spectrum. For the tasks wheatquot; gure 4,
racy increases with the number of training examples. crudequot; gure 5, and cbondquot; gure 7 all classi-
This holds for all learning methods and categoriza- ers perform similar for small training sets and the
tion tasks. Nevertheless, there are di erences in how di erence generally increases with an increasing num-
quickly the accuracy increases. In contrast to BAYES, ber of training examples.
6 Conclusions Cooper, 1991 W. Cooper, Some Inconsistencies
and Misnomers in Probabilistic Information Re-
This paper shows the relationship between text classi- trievalquot;, International ACM SIGIR Conference,
ers using the vector space model with TFIDF word pages 57-61, 1991.
weighting and probabilistic classi ers. It presents a Fuhr, 1989 N. Fuhr, Models for Retrieval with Prob-
probabilistic analysis of a particular TFIDF classi- abilistic Indexingquot;, Information Processing and
er and describes the algorithm using the same ba- Management, 251, pages 55-72, 1989.
sic techniques from statistical pattern recognition that Hayes et al., 1988 P. Hayes, L. Knecht, M. Cellio, A
are used in probabilistic classi ers like BAYES. The news story categorization systemquot;, Second Con-
analysis o ers a theoretical explanation for the TFIDF ference on Applied Natural Language Process-
word weighting heuristic in combination with the vec- ing, pages 9-17, 1988.
tor space retrieval model for text categorization and
gives insight into the underlying assumptions. Joachims et al., 1997 T. Joachims, D. Freitag, T.
Conclusions drawn from the analysis lead to the PrT- Mitchell, WebWatcher: A Tour Guide for the
World Wide Webquot;, International Joint Confer-
FIDF classi er, which eliminates the ine cient param- ence on Arti cial Intelligence IJCAI, 1997.
eter tuning and design choices of the TFIDF method.
This makes the PrTFIDF classi er easy to use and Lang, 1995 K. Lang, NewsWeeder: Learning to Fil-
empirical results on six classi cation tasks support its ter Netnewsquot;, International Conference on Ma-
applicability on real world classi cation problems. Al- chine Learning, 1995.
though the TFIDF method showed reasonable accu- Rocchio, 1971 J. Rocchio. Relevance Feedback in
racy on all classi cation tasks, the two probabilistic Information Retrievalquot;, in Salton: The SMART
methods BAYES and PrTFIDF showed performance Retrieval System: Experiments in Automatic
improvements of up to 40 reduction of error rate on Document Processing, Chapter 14, pages 313-
ve of the six tasks. These empirical results suggest 323, Prentice-Hall, 1971.
that a probabilistically founded modelling is preferable Salton, 1991 G. Salton, Developments in Automatic
to the heuristic TFIDF modelling. The probabilistic Text Retrievalquot;, Science, Vol. 253, pages 974-
methods are preferable from a theoretical viewpoint, 979, 1991.
too, since a probabilistic framework allows the clear
statement and easier understanding of the simplify- Salton, Buckley, 1988 G. Salton, C. Buckley, Term
ing assumptions made. The relaxation as well as the Weighting Approaches in Automatic Text Re-
combination of those assumptions provide promising trievalquot;, Information Processing and Manage-
starting points for future research. ment, Vol. 24, No. 5, pages 513-523, 1988.
Vapnik, 1982 V. Vapnik, Estimation of Dependen-
Acknowledgements cies Based on Empirical Dataquot;, Springer, 1982.
I would like to thank Tom Mitchell for his inspiring Wang et al., 1992 Z. Wang, S. Wong, Y. Yao, An
comments on this work. Many thanks also to Se- Analysis of Vector Space Models Based on Com-
bastian Thrun, Phoebe Sengers, Sean Slattery, Ralf putational Geometryquot;, International ACM SI-
Klinkenberg, and Peter Brockhausen for their sugges- GIR Conference, 1992.
tions regarding this paper, and to Ken Lang for the Wong, Yao, 1989 S. Wong, Y. Yao, A Note on In-
dataset and parts of the code used in the experiments. verse Document Frequency Weighting Schemequot;,
This research is supported by ARPA under grant num- Technical Report 89-990, Department of Com-
ber F33615-93-1-1330 at Carnegie Mellon University. puter Science, Cornell University, 1989.
Wu, Salton, 1981 H. Wu, G. Salton, A Compari-
References son of Search Term Weighting: Term Relevance
vs. Inverse Document Frequencyquot;, Technical Re-
Bookstein, 1982 A. Bookstein, Explanation and port 81-457, Department of Computer Science,
Generalization of Vector Models in Informa- Cornell University, 1981.
tion Retrievalquot;, in G. Salton, H. Schneider:
Research and Development in Information Re-
trieval, Berlin, 1982.
Buckley et al., 1994 C. Buckley, G. Salton, J. Allan,
The E ect of Adding Relevance Information
in a Relevance Feedback Environmentquot;, Interna-
tional ACM SIGIR Conference, pages 292-300,